中国综合性科技类核心期刊(北大核心)

中国科学引文数据库来源期刊(CSCD)

美国《化学文摘》(CA)收录

美国《数学评论》(MR)收录

俄罗斯《文摘杂志》收录

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

通过细粒度的语义特征与Transformer丰富图像描述

王俊豪 罗轶凤

王俊豪, 罗轶凤. 通过细粒度的语义特征与Transformer丰富图像描述[J]. 华东师范大学学报(自然科学版), 2020, (5): 56-67. doi: 10.3969/j.issn.1000-5641.202091004
引用本文: 王俊豪, 罗轶凤. 通过细粒度的语义特征与Transformer丰富图像描述[J]. 华东师范大学学报(自然科学版), 2020, (5): 56-67. doi: 10.3969/j.issn.1000-5641.202091004
WANG Junhao, LUO Yifeng. Enriching image descriptions by fusing fine-grained semantic features with a transformer[J]. Journal of East China Normal University (Natural Sciences), 2020, (5): 56-67. doi: 10.3969/j.issn.1000-5641.202091004
Citation: WANG Junhao, LUO Yifeng. Enriching image descriptions by fusing fine-grained semantic features with a transformer[J]. Journal of East China Normal University (Natural Sciences), 2020, (5): 56-67. doi: 10.3969/j.issn.1000-5641.202091004

通过细粒度的语义特征与Transformer丰富图像描述

doi: 10.3969/j.issn.1000-5641.202091004
基金项目: 国家重点研发计划(2018YFC0831904)
详细信息
    通讯作者:

    罗轶凤, 男, 副教授, 硕士生导师, 研究方向为文本数据挖掘与知识图谱. E-mail: yifluo@dase.ecnu.edu.cn

  • 中图分类号: TP399

Enriching image descriptions by fusing fine-grained semantic features with a transformer

  • 摘要: 传统的图像描述模型通常基于使用卷积神经网络(Convolutional Neural Network, CNN)和循环神经网络(Recurrent Neural Network, RNN)的编码器-解码器结构, 面临着遗失大量图像细节信息以及训练时间成本过高的问题. 提出了一个新颖的模型, 该模型包含紧凑的双线性编码器(Compact Bilinear Encoder)和紧凑的多模态解码器(Compact Multi-modal Decoder), 可通过细粒度的区域目标实体特征来改善图像描述. 在编码器中, 紧凑的双线性池化(Compact Bilinear Pooling, CBP)用于编码细粒度的语义图像区域特征, 该模块使用多层Transformer编码图像全局语义特征, 并将所有编码的特征通过门结构融合在一起, 作为图像的整体编码特征. 在解码器中, 从细粒度的区域目标实体特征和目标实体类别特征中提取多模态特征, 并将其与整体编码后的特征融合用于解码语义信息生成描述. 该模型在Microsoft COCO公开数据集上进行了广泛的实验, 实验结果显示, 与现有的模型相比, 该模型取得了更好的图像描述效果.
  • 图  1  模型描述生成效果图

    Fig.  1  An example of description generation using our model

    图  2  模型结构概览图

    Fig.  2  Model structure overview diagram

    表  1  添加门限结构和CBP的效果

    Tab.  1  The effect of introducing gated feature fusion and compact bilinear pooling

    添加类型对比模型B-1/%B-4/%M/%R/%C/%
    门限结构Transformer(未添加)74.635.126.655.3112.3
    Ml-Transformer(仅添加门限单元)76.036.027.856.4113.8
    CBP层Ml-Transformer(仅添加特征)76.736.327.556.6115.1
    Ml-Transformer(添加特征与CBP层)77.036.928.057.3116.5
    下载: 导出CSV

    表  2  不同特征抽取器抽取效果

    Tab.  2  The effect of extracting initial object features with various extractors

    视图特征抽取器B-1/%B-4/%M/%R/%C/%
    单视图R-10177.036.928.057.3116.5
    X-10177.236.627.857.0115.9
    D-12177.137.027.657.0116.2
    多视图R-101和X-10177.537.227.456.8115.5
    X-101和D-12177.437.227.556.7115.4
    下载: 导出CSV

    表  3  不同超参数m效果

    Tab.  3  The effect of various settings for the hyper-parameter m

    参数mB-1/%B-4/%M/%R/%C/%
    176.336.227.556.2115.4
    376.536.327.656.4115.8
    577.036.928.057.3116.5
    776.936.727.957.3116.3
    下载: 导出CSV

    表  4  整体模型效果对比

    Tab.  4  Overall image caption performance using all models

    模型交叉熵损失CIDEr-D优化
    B-1/%B-4/%M/%R/%C/%B-1/%B-4/%M/%R/%C/%
    SCST30.025.953.499.4-34.226.755.7114.0
    LSTM-A75.435.226.955.8108.878.635.527.356.8118.3
    Up-Down77.236.227.056.4113.579.836.327.756.9120.1
    RFNet76.435.827.456.8112.579.136.527.757.3121.9
    GCN-LSTM77.336.827.957.0116.380.538.228.558.3127.6
    SGAE80.838.428.458.6127.8
    Ml-Transformer(本文模型)77.036.928.057.3116.580.939.028.958.5129.1
    下载: 导出CSV

    表  5  不同编码器/解码器训练效率对比

    Tab.  5  Model training efficiencies for various kinds of encoders/decoders

    模型1个epoch/hepoches
    SCST1.240
    Up-Down0.8540
    Ml-Transformer(本文模型)0.8515
    下载: 导出CSV
  • [1] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652-663.
    [2] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: Curran Associates Inc., 2017: 6000–6010.
    [3] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 770-778.
    [4] KULKARNI G, PREMRAJ V, ORDONEZ V, et al. Babytalk: Understanding and generating simple image descriptions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2891-2903.
    [5] MITCHELL M, HAN X F, DODGE J, et al. Midge: Generating image descriptions from computer vision detections [C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. ACM, 2012: 747-756.
    [6] YANG Y Z, TEO C L, DAUMÉ H, et al. Corpus-guided sentence generation of natural images [C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. ACM, 2011: 444-454.
    [7] DEVLIN J, CHENG H, FANG H, et al. Language models for image captioning: The quirks and what works [EB/OL].(2015-10-14)[2020-06-30]. https://arxiv.org/pdf/1505.01809.pdf.
    [8] FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: Generating sentences from images [C]//Computer Vision – ECCV 2010, Lecture Notes in Computer Science, vol 6314. Berlin: Springer, 2010: 15-29.
    [9] KARPATHY A, JOULIN A, L F F. Deep fragment embeddings for bidirectional image sentence mapping [C]//Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Cambridge, MA: MIT Press, 2014: 1889-1897.
    [10] MAO J H, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks [EB/OL]. (2014-10-04)[2020-06-30]. https://arxiv.org/pdf/1410.1090.pdf.
    [11] LU J S, XIONG C M, PARIKH D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 3242-3250.
    [12] YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11218. Cham: Springer, 2018: 711-727.
    [13] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 3156-3164.
    [14] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 1-9.
    [15] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
    [16] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation [EB/OL].(2014-09-03)[2020-06-30]. https://arxiv.org/pdf/1406.1078.pdf.
    [17] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676.
    [18] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [EB/OL]. (2015-04-10)[2020-6-30]. https://arxiv.org/pdf/1409.1556.pdf.
    [19] XU K, BA J L, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention [EB/OL].(2016-04-19)[2020-6-30]. https://arxiv.org/pdf/1502.03044.pdf.
    [20] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 6077-6086.
    [21] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [C]//Advances in Neural Information Processing Systems 28 (NIPS 2015).[S.l.]: Curran Associates, Inc., 2015: 91-99.
    [22] LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear CNN models for fine-grained visual recognition [C]//2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015: 1449-1457.
    [23] GAO Y, BEIJBOM O, ZHANG N, et al. Compact bilinear pooling [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016: 317-326.
    [24] KONG S, FOWLKES C. Low-rank bilinear pooling for fine-grained classification [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 7025-7034.
    [25] WEI X, ZHANG Y, GONG Y H, et al. Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11207. Cham: Springer, 2018: 365-380.
    [26] CHARIKAR M, CHEN K, FARACH-COLTON M. Finding frequent items in data streams [C]//Automata, Languages and Programming, ICALP 2002, Lecture Notes in Computer Science, vol 2380. Berlin : Springer, 2002: 693-703.
    [27] PHAM N, PAGH R. Fast and scalable polynomial kernels via explicit feature maps [C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2013: 239-247.
    [28] BA J L, KIROS J R, HINTON G E. Layer normalization [EB/OL].(2016-07-21)[2020-6-30]. https://arxiv.org/pdf/1607.06450.pdf.
    [29] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 1179-1195,
    [30] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context [C]//Computer Vision – ECCV 2014, Lecture Notes in Computer Science, vol 8693. Cham: Springer, 2014: 740-755.
    [31] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. (2019-05-24)[2020-06-30].https://arxiv.org/pdf/1810.04805.pdf.
    [32] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation [C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2002: 311-318.
    [33] BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments [C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2005: 65-72.
    [34] LIN C Y. Rouge: A package for automatic evaluation of summaries [C]//Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL. Stroudsburg, PA: Association for Computational Linguistics(ACL), 2004: 74-81.
    [35] VEDANTAM R, ZITNICK C L, PARIKH D. Cider: Consensus-based image description evaluation [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 4566-4575.
    [36] YAO T, PAN Y W, LI Y H, et al. Boosting image captioning with attributes [C]//2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017: 4904-4912.
    [37] FANG H, GUPTA S, IANDOLA F, et al. From captions to visual concepts and back [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015: 1473-1482.
    [38] JIANG W H, MA L, JIANG Y G, et al. Recurrent fusion network for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11206. Cham: Springer, 2018: 510-526.
    [39] YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning [C]//Computer Vision – ECCV 2018, Lecture Notes in Computer Science, vol 11218. Cham: Springer, 2018: 711-727.
    [40] YANG X, TANG K H, ZHANG H W, et al. Auto-encoding scene graphs for image captioning [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019: 10677-10686.
  • 加载中
图(2) / 表(5)
计量
  • 文章访问数:  159
  • HTML全文浏览量:  257
  • PDF下载量:  7
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-08-04
  • 网络出版日期:  2020-09-24
  • 刊出日期:  2020-09-24

目录

    /

    返回文章
    返回