Enriching image descriptions by fusing fine-grained semantic features with a transformer
-
摘要: 传统的图像描述模型通常基于使用卷积神经网络(Convolutional Neural Network, CNN)和循环神经网络(Recurrent Neural Network, RNN)的编码器-解码器结构, 面临着遗失大量图像细节信息以及训练时间成本过高的问题. 提出了一个新颖的模型, 该模型包含紧凑的双线性编码器(Compact Bilinear Encoder)和紧凑的多模态解码器(Compact Multi-modal Decoder), 可通过细粒度的区域目标实体特征来改善图像描述. 在编码器中, 紧凑的双线性池化(Compact Bilinear Pooling, CBP)用于编码细粒度的语义图像区域特征, 该模块使用多层Transformer编码图像全局语义特征, 并将所有编码的特征通过门结构融合在一起, 作为图像的整体编码特征. 在解码器中, 从细粒度的区域目标实体特征和目标实体类别特征中提取多模态特征, 并将其与整体编码后的特征融合用于解码语义信息生成描述. 该模型在Microsoft COCO公开数据集上进行了广泛的实验, 实验结果显示, 与现有的模型相比, 该模型取得了更好的图像描述效果.
-
关键词:
- 图像描述 /
- 精细化特征 /
- 多模态特征 /
- Transformer
Abstract: Modern image captioning models following the encoder-decoder architecture of a convolutional neural network (CNN) or recurrent neural network (RNN) face the issue of dismissing a large amount of detailed information contained in images and the high cost of training time. In this paper, we propose a novel model, consisting of a compact bilinear encoder and a compact multi-modal decoder, to improve image captioning with fine-grained regional object features. In the encoder, compact bilinear pooling (CBP) is used to encode fine-grained semantic features from an image’s regional features and transformers are used to encode global semantic features from an image’s global bottom-up features; the collective encoded features are subsequently fused using a gate structure to form the overall encoded features of the image. In the decoding process, we extract multi-modal features from fine grained regional object features, and fuse them with overall encoded features to decode semantic information for description generation. Extensive experiments performed on the public Microsoft COCO dataset show that our model achieves state-of-the-art image captioning performance.-
Key words:
- image captioning /
- fine-grained features /
- multi-modal features /
- transformer
-
表 1 添加门限结构和CBP的效果
Tab. 1 The effect of introducing gated feature fusion and compact bilinear pooling
添加类型 对比模型 B-1/% B-4/% M/% R/% C/% 门限结构 Transformer(未添加) 74.6 35.1 26.6 55.3 112.3 Ml-Transformer(仅添加门限单元) 76.0 36.0 27.8 56.4 113.8 CBP层 Ml-Transformer(仅添加特征) 76.7 36.3 27.5 56.6 115.1 Ml-Transformer(添加特征与CBP层) 77.0 36.9 28.0 57.3 116.5 表 2 不同特征抽取器抽取效果
Tab. 2 The effect of extracting initial object features with various extractors
视图 特征抽取器 B-1/% B-4/% M/% R/% C/% 单视图 R-101 77.0 36.9 28.0 57.3 116.5 X-101 77.2 36.6 27.8 57.0 115.9 D-121 77.1 37.0 27.6 57.0 116.2 多视图 R-101和X-101 77.5 37.2 27.4 56.8 115.5 X-101和D-121 77.4 37.2 27.5 56.7 115.4 表 3 不同超参数m效果
Tab. 3 The effect of various settings for the hyper-parameter m
参数m B-1/% B-4/% M/% R/% C/% 1 76.3 36.2 27.5 56.2 115.4 3 76.5 36.3 27.6 56.4 115.8 5 77.0 36.9 28.0 57.3 116.5 7 76.9 36.7 27.9 57.3 116.3 表 4 整体模型效果对比
Tab. 4 Overall image caption performance using all models
模型 交叉熵损失 CIDEr-D优化 B-1/% B-4/% M/% R/% C/% B-1/% B-4/% M/% R/% C/% SCST 30.0 25.9 53.4 99.4 - 34.2 26.7 55.7 114.0 LSTM-A 75.4 35.2 26.9 55.8 108.8 78.6 35.5 27.3 56.8 118.3 Up-Down 77.2 36.2 27.0 56.4 113.5 79.8 36.3 27.7 56.9 120.1 RFNet 76.4 35.8 27.4 56.8 112.5 79.1 36.5 27.7 57.3 121.9 GCN-LSTM 77.3 36.8 27.9 57.0 116.3 80.5 38.2 28.5 58.3 127.6 SGAE 80.8 38.4 28.4 58.6 127.8 Ml-Transformer(本文模型) 77.0 36.9 28.0 57.3 116.5 80.9 39.0 28.9 58.5 129.1 表 5 不同编码器/解码器训练效率对比
Tab. 5 Model training efficiencies for various kinds of encoders/decoders
模型 1个epoch/h epoches SCST 1.2 40 Up-Down 0.85 40 Ml-Transformer(本文模型) 0.85 15 -
[1] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652-663. [2] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2017: 6000–6010. [3] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 770-778. [4] KULKARNI G, PREMRAJ V, ORDONEZ V, et al. Babytalk: Understanding and generating simple image descriptions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2891-2903. [5] MITCHELL M, HAN X F, DODGE J, et al. Midge: Generating image descriptions from computer vision detections [C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. ACM, 2012: 747-756. [6] YANG Y Z, TEO C L, DAUMÉ H, et al. Corpus-guided sentence generation of natural images [C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. ACM, 2011: 444-454. [7] DEVLIN J, CHENG H, FANG H, et al. Language models for image captioning: The quirks and what works [EB/OL].(2015-10-14)[2020-06-30]. https://arxiv.org/pdf/1505.01809.pdf. [8] FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: Generating sentences from images [C]//Computer Vision – ECCV 2010, Lecture Notes in Computer Science, vol 6314. Berlin: Springer, 2010: 15-29. [9] KARPATHY A, JOULIN A, L F F. Deep fragment embeddings for bidirectional image sentence mapping [C]//Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Cambridge, MA: MIT Press, 2014: 1889-1897. [10] MAO J H, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks [EB/OL]. (2014-10-04)[2020-06-30]. https://arxiv.org/pdf/1410.1090.pdf. [11] LU J S, XIONG C M, PARIKH D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 3242-3250. [12] YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11218. Cham: Springer, 2018: 711-727. [13] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 3156-3164. [14] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 1-9. [15] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780. [16] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [EB/OL].(2014-09-03)[2020-06-30]. https://arxiv.org/pdf/1406.1078.pdf. [17] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676. [18] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [EB/OL]. (2015-04-10)[2020-6-30]. https://arxiv.org/pdf/1409.1556.pdf. [19] XU K, BA J L, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention [EB/OL].(2016-04-19)[2020-6-30]. https://arxiv.org/pdf/1502.03044.pdf. [20] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 6077-6086. [21] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [C]//Advances in Neural Information Processing Systems 28 (NIPS 2015).[S.l.]: Curran Associates, Inc., 2015: 91-99. [22] LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear CNN models for fine-grained visual recognition [C]//2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015: 1449-1457. [23] GAO Y, BEIJBOM O, ZHANG N, et al. Compact bilinear pooling [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 317-326. [24] KONG S, FOWLKES C. Low-rank bilinear pooling for fine-grained classification [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 7025-7034. [25] WEI X, ZHANG Y, GONG Y H, et al. Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11207. Cham: Springer, 2018: 365-380. [26] CHARIKAR M, CHEN K, FARACH-COLTON M. Finding frequent items in data streams [C]//Automata, Languages and Programming, ICALP 2002, Lecture Notes in Computer Science, vol 2380. Berlin : Springer, 2002: 693-703. [27] PHAM N, PAGH R. Fast and scalable polynomial kernels via explicit feature maps [C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2013: 239-247. [28] BA J L, KIROS J R, HINTON G E. Layer normalization [EB/OL].(2016-07-21)[2020-6-30]. https://arxiv.org/pdf/1607.06450.pdf. [29] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 1179-1195, [30] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [C]//Computer Vision – ECCV 2014, Lecture Notes in Computer Science, vol 8693. Cham: Springer, 2014: 740-755. [31] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24)[2020-06-30].https://arxiv.org/pdf/1810.04805.pdf. [32] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation [C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2002: 311-318. [33] BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments [C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2005: 65-72. [34] LIN C Y. Rouge: A package for automatic evaluation of summaries [C]//Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2004: 74-81. [35] VEDANTAM R, ZITNICK C L, PARIKH D. Cider: Consensus-based image description evaluation [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 4566-4575. [36] YAO T, PAN Y W, LI Y H, et al. Boosting image captioning with attributes [C]//2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 4904-4912. [37] FANG H, GUPTA S, IANDOLA F, et al. From captions to visual concepts and back [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 1473-1482. [38] JIANG W H, MA L, JIANG Y G, et al. Recurrent fusion network for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11206. Cham: Springer, 2018: 510-526. [39] YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11218. Cham: Springer, 2018: 711-727. [40] YANG X, TANG K H, ZHANG H W, et al. Auto-encoding scene graphs for image captioning [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 10677-10686.