Multimodal Convolutional Neural Networks for Matching Image and Sentence

Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li

Huawei Noah's Ark Lab, Hong Kong

[PDF] [Video Spotlight] [Poster Presentation]


In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content and one matching CNN modeling the joint representation of image and sentence. The matching CNN composes different semantic fragments from words and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. More specifically, our proposed m-CNNs significantly outperform the state-of-the-art approaches for bidirectional image and sentence retrieval on the Flickr8K, Flickr30K, and Micorsoft COCO datasets.

The contributions of this work:

Multimodal Convolutional Neural Network (m-CNN)

Multimodal matching between image and sentence is complicated, and usually occurs at different levels, specifically the local word and phrase levels as well as the global sentence level.

Multimodal convolutional neural network (m-CNN) takes the image and sentence as input and generates the matching score between them, which consists of three components.

Experimental Results

Bidirectional image and sentence retrieval results on Flickr8K

Bidirectional image and sentence retrieval results on Flick30K 

Bidirectional image and sentence retrieval results on Microsoft COCO


L. Ma, Z. Lu, L. Shang, and H. Li, "Multimodal Convolutional Neural Network for Matching Image and Sentence", ICCV 2015. [Full Text]
A. Frame, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov, "Devise: A Deep Visual-Semantic Embedding Model", NIPS, 2013. [Full Text]
R. Socher, Q. V. L. A. Karpathy, C. D. Manning, and A. Y. Ng, "Grounded Compositional Semantics for Finding and Describing Images with Sentences", TACL, 2: 207-218, 2014. [Full Text]
R. Kiros, R. Salakhutdinov, and R. S. Zemel, "Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models", arXiv:1411.2539, 2014. [Full Text]
J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, "Explain Images with Multimodal Recurrent Neural Networks", arXiv: 1410.1090, 2014. [Full Text]
Deep Fragment
A. Karpathy, A. Joulin, and L. Fei-Fei, "Deep Fragment Embeddings for Bidirectional Image Sentence Mapping", NIPS, 2014. [Full Text]
X. Chen and C. L. Zitnick, "Learning a Recurrent Visual Representation for Image Caption Generation", arXiv:1411.5654, 2014. [Full Text]
A. Karpathy and L. Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions", arXiv:1412.2306, 2014. [Full Text]
F. Yan and K. Mikolajczyk, "Deep Correlation for Matching Images and Text", CVPR, 2015. [Full Text]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and Tell: A Neural Image Caption Generator", arXiv:1411.4556, 2014. [Full Text]
B. Klein, G. Lev, G. Sadeh, and L. Wolf, "Associating Neural Word Embeddings with Deep Image Representations using Fisher Vectors", CVPR, 2015. [Full Text]
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darreell, "Long-term Recurrent Convolutional Networks for Visual Recognition and Description", arXiv:1411.4389, 2014. [Full Text]
B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, and S. Lazebnik, "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models", arXiv:1505.04870, 2015. [Full Text]
R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler, "Skip-Thought Vectors", arXiv: 1506.06726, 2015. [Full Text]
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Lecun, "Overfeat: Intergrated Recognition, Localization and Detection Using Convolutional Networks", arXiv:1312.6229, 2014. [Full Text]
K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition", arXiv: 1409.1556, 2014. [Full Text]

Contact Me

If you have any questions, please feel free to contact Dr. Lin Ma (

Back to top

Last update: Nov. 17, 2015