通过细粒度的语义特征与Transformer丰富图像描述

王俊豪; 罗轶凤

doi:10.3969/j.issn.1000-5641.202091004

通过细粒度的语义特征与Transformer丰富图像描述

doi: 10.3969/j.issn.1000-5641.202091004

王俊豪,
罗轶凤^,

华东师范大学数据科学与工程学院, 上海　200062

基金项目: 国家重点研发计划(2018YFC0831904)

详细信息

通讯作者:
罗轶凤, 男, 副教授, 硕士生导师, 研究方向为文本数据挖掘与知识图谱. E-mail: yifluo@dase.ecnu.edu.cn

中图分类号: TP399
计量
- 文章访问数: 159
- HTML全文浏览量: 257
- PDF下载量: 7
- 被引次数: 0
出版历程
- 收稿日期: 2020-08-04
- 网络出版日期: 2020-09-24
- 刊出日期: 2020-09-24

Enriching image descriptions by fusing fine-grained semantic features with a transformer

WANG Junhao,
LUO Yifeng^,

School of Data Science and Engineering, East China Normal University, Shanghai　200062, China

摘要

摘要: 传统的图像描述模型通常基于使用卷积神经网络(Convolutional Neural Network, CNN)和循环神经网络(Recurrent Neural Network, RNN)的编码器-解码器结构, 面临着遗失大量图像细节信息以及训练时间成本过高的问题. 提出了一个新颖的模型, 该模型包含紧凑的双线性编码器(Compact Bilinear Encoder)和紧凑的多模态解码器(Compact Multi-modal Decoder), 可通过细粒度的区域目标实体特征来改善图像描述. 在编码器中, 紧凑的双线性池化(Compact Bilinear Pooling, CBP)用于编码细粒度的语义图像区域特征, 该模块使用多层Transformer编码图像全局语义特征, 并将所有编码的特征通过门结构融合在一起, 作为图像的整体编码特征. 在解码器中, 从细粒度的区域目标实体特征和目标实体类别特征中提取多模态特征, 并将其与整体编码后的特征融合用于解码语义信息生成描述. 该模型在Microsoft COCO公开数据集上进行了广泛的实验, 实验结果显示, 与现有的模型相比, 该模型取得了更好的图像描述效果.
- 图像描述 /
- 精细化特征 /
- 多模态特征 /
- Transformer
Abstract: Modern image captioning models following the encoder-decoder architecture of a convolutional neural network (CNN) or recurrent neural network (RNN) face the issue of dismissing a large amount of detailed information contained in images and the high cost of training time. In this paper, we propose a novel model, consisting of a compact bilinear encoder and a compact multi-modal decoder, to improve image captioning with fine-grained regional object features. In the encoder, compact bilinear pooling (CBP) is used to encode fine-grained semantic features from an image’s regional features and transformers are used to encode global semantic features from an image’s global bottom-up features; the collective encoded features are subsequently fused using a gate structure to form the overall encoded features of the image. In the decoding process, we extract multi-modal features from fine grained regional object features, and fuse them with overall encoded features to decode semantic information for description generation. Extensive experiments performed on the public Microsoft COCO dataset show that our model achieves state-of-the-art image captioning performance.
- image captioning /
- fine-grained features /
- multi-modal features /
- transformer

HTML全文

图 1 模型描述生成效果图

Fig. 1 An example of description generation using our model

下载: 全尺寸图片幻灯片

图 2 模型结构概览图

Fig. 2 Model structure overview diagram

下载: 全尺寸图片幻灯片

表 1 添加门限结构和CBP的效果

Tab. 1 The effect of introducing gated feature fusion and compact bilinear pooling

添加类型	对比模型	B-1/%	B-4/%	M/%	R/%	C/%
门限结构	Transformer(未添加)	74.6	35.1	26.6	55.3	112.3
门限结构	Ml-Transformer(仅添加门限单元)	76.0	36.0	27.8	56.4	113.8
CBP层	Ml-Transformer(仅添加特征)	76.7	36.3	27.5	56.6	115.1
CBP层	Ml-Transformer(添加特征与CBP层)	77.0	36.9	28.0	57.3	116.5

下载: 导出CSV

表 2 不同特征抽取器抽取效果

Tab. 2 The effect of extracting initial object features with various extractors

视图	特征抽取器	B-1/%	B-4/%	M/%	R/%	C/%
单视图	R-101	77.0	36.9	28.0	57.3	116.5
	X-101	77.2	36.6	27.8	57.0	115.9
	D-121	77.1	37.0	27.6	57.0	116.2
多视图	R-101和X-101	77.5	37.2	27.4	56.8	115.5
多视图	X-101和D-121	77.4	37.2	27.5	56.7	115.4

下载: 导出CSV

表 3 不同超参数m效果

Tab. 3 The effect of various settings for the hyper-parameter m

参数m	B-1/%	B-4/%	M/%	R/%	C/%
1	76.3	36.2	27.5	56.2	115.4
3	76.5	36.3	27.6	56.4	115.8
5	77.0	36.9	28.0	57.3	116.5
7	76.9	36.7	27.9	57.3	116.3

下载: 导出CSV

表 4 整体模型效果对比

Tab. 4 Overall image caption performance using all models

模型	交叉熵损失					CIDEr-D优化
模型	B-1/%	B-4/%	M/%	R/%	C/%	B-1/%	B-4/%	M/%	R/%	C/%
SCST		30.0	25.9	53.4	99.4	-	34.2	26.7	55.7	114.0
LSTM-A	75.4	35.2	26.9	55.8	108.8	78.6	35.5	27.3	56.8	118.3
Up-Down	77.2	36.2	27.0	56.4	113.5	79.8	36.3	27.7	56.9	120.1
RFNet	76.4	35.8	27.4	56.8	112.5	79.1	36.5	27.7	57.3	121.9
GCN-LSTM	77.3	36.8	27.9	57.0	116.3	80.5	38.2	28.5	58.3	127.6
SGAE						80.8	38.4	28.4	58.6	127.8
Ml-Transformer(本文模型)	77.0	36.9	28.0	57.3	116.5	80.9	39.0	28.9	58.5	129.1

下载: 导出CSV

表 5 不同编码器/解码器训练效率对比

Tab. 5 Model training efficiencies for various kinds of encoders/decoders

模型	1个epoch/h	epoches
SCST	1.2	40
Up-Down	0.85	40
Ml-Transformer(本文模型)	0.85	15

下载: 导出CSV

参考文献(40)

[1]	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652-663.
[2]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2017: 6000–6010.
[3]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 770-778.
[4]	KULKARNI G, PREMRAJ V, ORDONEZ V, et al. Babytalk: Understanding and generating simple image descriptions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2891-2903.
[5]	MITCHELL M, HAN X F, DODGE J, et al. Midge: Generating image descriptions from computer vision detections [C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. ACM, 2012: 747-756.
[6]	YANG Y Z, TEO C L, DAUMÉ H, et al. Corpus-guided sentence generation of natural images [C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. ACM, 2011: 444-454.
[7]	DEVLIN J, CHENG H, FANG H, et al. Language models for image captioning: The quirks and what works [EB/OL].(2015-10-14)[2020-06-30]. https://arxiv.org/pdf/1505.01809.pdf.
[8]	FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: Generating sentences from images [C]//Computer Vision – ECCV 2010, Lecture Notes in Computer Science, vol 6314. Berlin: Springer, 2010: 15-29.
[9]	KARPATHY A, JOULIN A, L F F. Deep fragment embeddings for bidirectional image sentence mapping [C]//Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Cambridge, MA: MIT Press, 2014: 1889-1897.
[10]	MAO J H, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks [EB/OL]. (2014-10-04)[2020-06-30]. https://arxiv.org/pdf/1410.1090.pdf.
[11]	LU J S, XIONG C M, PARIKH D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 3242-3250.
[12]	YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11218. Cham: Springer, 2018: 711-727.
[13]	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 3156-3164.
[14]	SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 1-9.
[15]	HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
[16]	CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [EB/OL].(2014-09-03)[2020-06-30]. https://arxiv.org/pdf/1406.1078.pdf.
[17]	KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676.
[18]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [EB/OL]. (2015-04-10)[2020-6-30]. https://arxiv.org/pdf/1409.1556.pdf.
[19]	XU K, BA J L, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention [EB/OL].(2016-04-19)[2020-6-30]. https://arxiv.org/pdf/1502.03044.pdf.
[20]	ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 6077-6086.
[21]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [C]//Advances in Neural Information Processing Systems 28 (NIPS 2015).[S.l.]: Curran Associates, Inc., 2015: 91-99.
[22]	LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear CNN models for fine-grained visual recognition [C]//2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015: 1449-1457.
[23]	GAO Y, BEIJBOM O, ZHANG N, et al. Compact bilinear pooling [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 317-326.
[24]	KONG S, FOWLKES C. Low-rank bilinear pooling for fine-grained classification [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 7025-7034.
[25]	WEI X, ZHANG Y, GONG Y H, et al. Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11207. Cham: Springer, 2018: 365-380.
[26]	CHARIKAR M, CHEN K, FARACH-COLTON M. Finding frequent items in data streams [C]//Automata, Languages and Programming, ICALP 2002, Lecture Notes in Computer Science, vol 2380. Berlin : Springer, 2002: 693-703.
[27]	PHAM N, PAGH R. Fast and scalable polynomial kernels via explicit feature maps [C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2013: 239-247.
[28]	BA J L, KIROS J R, HINTON G E. Layer normalization [EB/OL].(2016-07-21)[2020-6-30]. https://arxiv.org/pdf/1607.06450.pdf.
[29]	RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 1179-1195,
[30]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [C]//Computer Vision – ECCV 2014, Lecture Notes in Computer Science, vol 8693. Cham: Springer, 2014: 740-755.
[31]	DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24)[2020-06-30].https://arxiv.org/pdf/1810.04805.pdf.
[32]	PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation [C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2002: 311-318.
[33]	BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments [C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2005: 65-72.
[34]	LIN C Y. Rouge: A package for automatic evaluation of summaries [C]//Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2004: 74-81.
[35]	VEDANTAM R, ZITNICK C L, PARIKH D. Cider: Consensus-based image description evaluation [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 4566-4575.
[36]	YAO T, PAN Y W, LI Y H, et al. Boosting image captioning with attributes [C]//2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 4904-4912.
[37]	FANG H, GUPTA S, IANDOLA F, et al. From captions to visual concepts and back [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 1473-1482.
[38]	JIANG W H, MA L, JIANG Y G, et al. Recurrent fusion network for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11206. Cham: Springer, 2018: 510-526.
[39]	YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11218. Cham: Springer, 2018: 711-727.
[40]	YANG X, TANG K H, ZHANG H W, et al. Auto-encoding scene graphs for image captioning [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 10677-10686.

施引文献

资源附件(0)

访问统计

点击查看大图

图(2) / 表(5)

计量

文章访问数: 159
HTML全文浏览量: 257
PDF下载量: 7
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

通过细粒度的语义特征与Transformer丰富图像描述

doi: 10.3969/j.issn.1000-5641.202091004

通讯作者:
罗轶凤, 男, 副教授, 硕士生导师, 研究方向为文本数据挖掘与知识图谱. E-mail: yifluo@dase.ecnu.edu.cn

计量

Enriching image descriptions by fusing fine-grained semantic features with a transformer

计量

目录

留言板

通过细粒度的语义特征与Transformer丰富图像描述

doi: 10.3969/j.issn.1000-5641.202091004

通讯作者: 罗轶凤, 男, 副教授, 硕士生导师, 研究方向为文本数据挖掘与知识图谱. E-mail: yifluo@dase.ecnu.edu.cn

计量

出版历程

Enriching image descriptions by fusing fine-grained semantic features with a transformer

计量

出版历程

目录

通讯作者:
罗轶凤, 男, 副教授, 硕士生导师, 研究方向为文本数据挖掘与知识图谱. E-mail: yifluo@dase.ecnu.edu.cn