Bi-Attention Enhanced Representation Learning for Image-Text Matching (2023)

Table of Contents
pattern recognition Abstract introduction section cutouts overview Notations and problem formulation try it Diploma Declaration of Competing Interests thanks references(41) Multitasking framework based on feature separation and reconstruction for cross-modal retrieval pattern recognition. Multimodal subspace support vector data description pattern recognition. A feature-consistency-driven attention-cancellation network for fine-grained image search pattern recognition. Efficient gesture recognition to support visually impaired people with multi-head neural networks Eng. Appl. Artif. Intelligence MS2GAH: Semantically supervised multi-label attention hashing for robust cross-modal retrieval pattern recognition. Common and individual matrix factorization hashing for large-scale cross-modal fetching pattern recognition. Cross-modal attention with semantic consistency for image-text matching IEEE Trans. Neural network learning. system Unimodal regularization based on beta distribution for deep ordinal regression pattern recognition. Deep correlation for matching images and text Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition Deep cross-modal projection learning for image-text matching Proceedings of the 15th European Conference on Computer Vision Bidirectional spatial-semantic attention networks for image-text matching IEEE-Trans. Bildprozess. Self-awareness embeddings of learning fragments for image-text matching Proceedings of the 27th ACM International Conference on Multimedia Multimodal cross-attention network for image and sentence matching Proceedings of the 33rd IEEE/CVF Conference on Computer Vision and Pattern Recognition FaceNet: a unified embedding for face recognition and clustering Proceedings of the Conference on Computer Vision and Pattern Recognition A new approach to cross-modal multimedia retrieval Proceedings of the 18th ACM International Conference on Multimedia Deep residual learning for image recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Instance-aware image and sentence matching with selective multimodal LSTM Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Multimodal convolutional neural networks to match image and sentence 2015 IEEE International Conference on Computer Vision, Santiago, Chile Faster r-CNN: Towards real-time object detection with regional suggestion networks IEEE Trans. pattern anal do intelligence BERT: Pretraining Deep Bidirectional Transformers for Speech Comprehension Proceedings of the 18th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Cited by (0) Featured Articles (6) BabyNet: Reconstructing 3D babies' faces from uncalibrated photos TETFN: A Text-Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis Multilevel similarity learning for image-text recognition Iterative graph attentional memory network for cross-modal recall Multi-view intermodality display with progressive fusion for image-text matching Complementarity is King: Multimodal and Multigranular Hierarchical Semantic Enhancement Network for Cross-Modal Retrieval Videos

pattern recognition

Band 140,

August 2023

, 109548

Author links open the overlay panel, , , , ,

Abstract

Image-text matching has become a research hotspot in recent years. The key to image-text matching is to accurately measure the similarity between an image and a sentence. However, most existing methods either focus on the intermodal similarities between regions in images and words in text or the intramodal similarities in image regions or words, so they cannot take good advantage of detailed correlations between images and texts. Furthermore, existing methods typically train their models using a triplet precedence loss based on the similarity of randomly selected triples. Because the weights of positive and negative samples are not adjusted, it cannot provide enough gradient information for training, resulting in slow convergence and limited performance. To address the above issues, we propose an image-text matching method called Bi-Attention Enhanced Representation Learning (BAERL). It builds a self-attention learning subnetwork to exploit intramodality correlations within image regions or words, and a co-attentional learning subnetwork to exploit intermodality correlations between image regions and words. Then representations obtained from two sub-networks capture holistic correlations between images and texts. Additionally, BAERL uses self-similarity polynomial loss instead of triplet rank loss to train the model. The self-similarity polynomial loss can adaptively assign appropriate weights to different pairs based on their similarity scores to further improve recovery performance. Experiments on two benchmark datasets demonstrate the superior performance of the proposed BAERL method over several state-of-the-art methods.

introduction

With the rapid development of the Internet and multimedia technology, research into relationships between multimedia data (image, text, audio, video, etc.) is becoming increasingly important for many applications, e.g. 1]. Image-text matching [2], [3] is one of the most important tasks in multimedia data relationship mining, which is widely used in many applications. For example, it can be used in search engines to find desired multimedia resources, in e-commerce websites to match product information, in the development of assistance tools for visually impaired people [4]. With the continuous application of deep learning technologies in multimedia [5], [6], image-text retrieval methods can be divided into three categories: subspace learning-based methods, CNN-RNN-based methods, and transformer-based methods.

Methods based on subspace learning [7], [8] aim to find a common subspace in which similarities between data of different modalities can be calculated directly. These methods learn linear distance metrics and cannot truly capture the high-level semantics from nonlinear multimodal data, leading to their unsatisfactory performance in practical applications. CNN-RNN based methods [9], [10], [11] usually first extract global image features through convolutional neural networks (CNN) and text features through recurrent neural networks (RNN), and then use multiple linear layers to map image and text Features to a common embedded area and perform image to text matching. Such methods neglect the fine-grained similarity relationships between image regions and words, which limits their matching accuracies. To solve this problem, many CNN-RNN-based methods are designed to exploit more granular similarity relationships between images and text.

Recently, Transformer has made remarkable advances in retrieval and detection applications. Many recent research efforts have been devoted to using Transformer instead of the common CNN-RNN network in image-text matching tasks [12], [13]. Self-Attention Embeddings (SAEM) [12] uses Transformer's self-attention mechanism to model fragment relationships between images or texts. But it only focuses on exploring the intramodality similarity between image regions or words and ignores the intermodality similarity relationships between image regions and words. Based on SAEM, the Multi-Modality Cross Attention (MMCA) network [13] designs a novel cross-attention mechanism to collectively model the intra-modality and inter-modality relationships between image regions and words. While his model tends to retain more intra-modality correlations than inter-modality correlations. In summary, existing Transformer-based image-text matching methods are not good at balancing the intermodality and intramodality similarities between image regions and words, both of which are critical to image-text matching.

In addition, existing methods typically use triplet rank losses to train their models. However, for the traditional triplet loss, the selection of the triplet is very important. If the selected triplets meet the requirements that the distances between samples of different categories are much larger than those between samples of the same category, these triplets do not contribute to the training process and result in slower convergence [14].

To address the above challenges, a novel Transformer-based image-text matching technique called Bi-Attention Enhanced Representation Learning (BAERL) Network is proposed in this paper. The proposed network designs a self-attentional learning subnetwork to explore the in-modality semantic relationships within image regions or words, and a co-attentional learning subnetwork to exploit intermodality correlations between image regions and words. By manually adjusting the appropriate weights to compensate for intramodal and intermodal correlations, the BAERL network achieves promising performance. In addition, the self-similarity polynomial loss is used to more appropriately weight different pairs in the network's training process to further improve recovery performance. The contributions of this work can be summarized as follows.

A novel image-text representation learning network that fully exploits Transformer's multi-head self-awareness mechanism is proposed to capture intermodal and intramodal correlations between regions in images and words in text in image-text matching tasks.

(Video) CLIP: Connecting Text and Images

Self-similarity polynomial loss, which adaptively assigns appropriate weights to different pairs according to their similarity scores, is used to train the network to further improve retrieval performance.

Extensive experimental results on two benchmark datasets demonstrate the superiority of the proposed BAERL over several state-of-the-art methods for the image-text matching task.

The remainder of this paper is organized as follows. Section 3 presents the proposed BAERL network. Section 4 demonstrates the experimental results and analysis, and conclusions are drawn in Section 5.

section cutouts

overview

Image-Text Matching aims to solve the problem of similarity matching between relevant images and texts. The core problem of image-text matching is to accurately measure the similarity between an image and a sentence. The modality gap between image and text creates a major challenge in measuring their similarities. Existing cross-modal retrieval methods are mainly divided into three categories: subspace learning-based methods, CNN-RNN-based methods, and transformer-based methods.

learn subspace

Notations and problem formulation

Assume there areNPicturesX={X1;X2;;XN}AndNsentencesY={j1;j2;;jN}, WoXNis theNpicture example andjNis the paired set ofNpicture example. For image samples, we use a pre-trained Faster R-CNN [19] network in conjunction with ResNet-101 [16] for extractionK1image regions. The Faster R-CNN with ResNet-101 is pre-trained for classification on ImageNet [24] and then on the Visual Genome [25] dataset, with ResNet-101 serving as the feature extractor of Faster R-CNN. For

try it

In this section, we perform experiments on two cross-modal benchmark datasets MS-COCO [31] and Flickr30K [32] to verify the effectiveness of the proposed BAERL network. The performance of the proposed method and comparative methods will be evaluated by[Email Protected]Score metric commonly used in cross-modal fetching. Furthermore, the effectiveness of each sub-network in the proposed framework is verified by ablation experiments and parameter sensitivity experiments.

Diploma

In this article, we propose an image-text matching method called Bi-Attention Enhanced Representation Learning (BAERL). It learns image and text representations that contain intramodal and intermodal similarities of image regions and words. A self-attention learning sub-network and a co-attention learning sub-network are designed to learn the self-attention representations and co-attention representations of images and sentences, respectively. Also, the self-similarity polynomial

Declaration of Competing Interests

The authors declare that they are not aware of any competing financial interests or personal relationships that may have influenced the work described in this document.

thanks

This work was supported in part by the National Key R&D Program of China, No. 2022ZD0117103, in part by theNational Science Foundation of Chinaunder grants62072354,61972302,62276203And62072355, partly through the Shaanxi Province Key Research and Development Program under grants 2022GY-057, 2021GY-086 and 2021GY-014, partly through the Shaanxi Province Key Industry Innovation Chain Projects under grants 2021ZDLGY07-04 and 2019ZDLGY13 -01, partly by the Foundation

Yumin Tianreceived BSc and MSc degrees in Computer Applications from Xidian University, China in 1984 and 1987 respectively. She is currently a professor in the School of Computer Science and Technology at Xidian University. Her research interests include image processing, 3D shape restoration, digital watermarking, and computer vision.

references(41)

  • L.Zhanget al.

    Multitasking framework based on feature separation and reconstruction for cross-modal retrieval

    pattern recognition.

    (2022)

  • F.Suhrabet al.

    Multimodal subspace support vector data description

    pattern recognition.

    (2021)

  • Q.Zhaoet al.

    A feature-consistency-driven attention-cancellation network for fine-grained image search

    pattern recognition.

    (2022)

  • S.Alashhabet al.

    Efficient gesture recognition to support visually impaired people with multi-head neural networks

    Eng. Appl. Artif. Intelligence

    (2022)

  • Y.Duanet al.

    MS2GAH: Semantically supervised multi-label attention hashing for robust cross-modal retrieval

    pattern recognition.

    (2022)

  • D.Wanget al.

    Common and individual matrix factorization hashing for large-scale cross-modal fetching

    pattern recognition.

    (2020)

    (Video) OpenAI CLIP: ConnectingText and Images (Paper Explained)

  • X.Xuet al.

    Cross-modal attention with semantic consistency for image-text matching

    IEEE Trans. Neural network learning. system

    (2020)

  • V.M.Wehet al.

    Unimodal regularization based on beta distribution for deep ordinal regression

    pattern recognition.

    (2022)

  • F.Janet al.

    Deep correlation for matching images and text

    Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition

    (2015)

  • Y.Zhanget al.

    Deep cross-modal projection learning for image-text matching

    Proceedings of the 15th European Conference on Computer Vision

    (2018)

  • F.Huanget al.

    Bidirectional spatial-semantic attention networks for image-text matching

    IEEE-Trans. Bildprozess.

    (2019)

  • Y.Wuet al.

    Self-awareness embeddings of learning fragments for image-text matching

    Proceedings of the 27th ACM International Conference on Multimedia

    (2019)

  • X.Weiet al.

    Multimodal cross-attention network for image and sentence matching

    Proceedings of the 33rd IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2020)

  • F.ruggedet al.

    FaceNet: a unified embedding for face recognition and clustering

    Proceedings of the Conference on Computer Vision and Pattern Recognition

    (2015)

  • N.Rashivasiaet al.

    A new approach to cross-modal multimedia retrieval

    Proceedings of the 18th ACM International Conference on Multimedia

    (2010)

  • K.Iset al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)

  • H.Janet al.

    Instance-aware image and sentence matching with selective multimodal LSTM

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)

  • L.Maet al.

    Multimodal convolutional neural networks to match image and sentence

    2015 IEEE International Conference on Computer Vision, Santiago, Chile

    (2015)

  • S.Renet al.

    Faster r-CNN: Towards real-time object detection with regional suggestion networks

    IEEE Trans. pattern anal do intelligence

    (2017)

  • J.Devlinet al.

    BERT: Pretraining Deep Bidirectional Transformers for Speech Comprehension

    Proceedings of the 18th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

    (2019)

  • Cited by (0)

    Featured Articles (6)

    • research article

      BabyNet: Reconstructing 3D babies' faces from uncalibrated photos

      Pattern Recognition, Vol. 139, 2023, Item 109367

      We present a 3D facial reconstruction system aimed at restoring babies' 3D facial geometry from uncalibrated photos, BabyNet. Because the 3D facial geometry of babies differs significantly from that of adults, baby-specific facial reconstruction systems are needed. BabyNet consists of two stages: 1) a 3D convolving graph autoencoder learns a latent space of the baby's 3D face shape; and 2) a 2D encoder that maps photographs to 3D latent space based on representative features extracted using transfer learning. This allows us to recreate a 3D face from 2D images using the pre-trained 3D decoder. We evaluate BabyNet and show that 1) methods based on adult datasets fail to model babies' 3D facial geometry, demonstrating the need for a baby-specific method, and 2) BabyNet outperforms classic model-fitting methods, even if it is one baby-specific 3D method is morphable model, like BabyFM, is used.

      (Video) AIMI Seminar | Pranav Rajpurkar - Big Bets of Teaching Machines to Read Medical Images
    • research article

      TETFN: A Text-Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis

      Pattern Recognition, Volume 136, 2023, Item 109259

      Multimodal sentiment analysis (MSA), which aims to detect the mood expressed by speakers in videos using textual, visual, and auditory cues, has attracted significant research attention in recent years. However, textual, visual, and auditory modalities often contribute differently to mood analysis. In general, text contains more intuitive emotional information and outperforms non-linguistic modalities in MSA. Finding a strategy to exploit this property to obtain a fusion representation that contains more mood-related information while preserving relationships between and within the modality becomes a major challenge. To this end, we propose a novel method called Text Enhanced Transformer Fusion Network (TETFN) that learns text-oriented pairwise cross-modal mappings to obtain effective unified multimodal representations. In particular, textual information is included in learning emotional nonverbal representations through text-based multi-head attention. In addition to maintaining consistency information through cross-modal mapping, the differentiated information between modalities is also preserved through unimodal label prediction. In addition, the pre-trained vision model Vision-Transformer is used to extract visual features from the original videos to preserve both global and local information of a human face. Extensive experiments with the benchmark datasets CMU-MOSI and CMU-MOSEI demonstrate the superior performance of the proposed TETFN compared to state-of-the-art methods.

    • research article

      Multilevel similarity learning for image-text recognition

      Information Processing & Management, Volume 58, Issue 1, 2021, Item 102432

      The image-text retrieval task has been a popular research topic and is attracting growing interest as it bridges the computer vision and natural language processing communities and involves two distinct modalities. Although many methods have made great strides in the image-text task, it remains a challenge as it is difficult to learn the correspondence between two heterogeneous modalities. In this article, we propose a multi-level representation learning for the image-text retrieval task that uses semantic, structural, and contextual information to improve the quality of the visual and textual representation. To leverage information at the semantic level, we first extract the high-frequency nouns, adjectives, and numbers as semantic labels, and adopt a convolutional neural multi-label framework to encode the information at the semantic level. To examine the structural-level information of image-text pairs, we first construct two graphs to encode the visual and textual information in terms of the corresponding modality, and then apply triplet-loss graph matching to resolve the discrepancy between the reduce modalities. To further improve the retrieval results, we use the context-level information from two modalities to refine the ranking and improve the retrieval quality. Extensive experiments with Flickr30k and MSCOCO, two commonly used datasets for image-text retrieval, have demonstrated the superiority of our proposed method.

    • research article

      Iterative graph attentional memory network for cross-modal recall

      Knowledge-Based Systems, Volume 226, 2021, Article 107138

      How to eliminate the semantic gap between multimodal data and effectively fuse multimodal data is the key problem of cross-modal retrieval. The abstractness of the semantics makes the semantic representation one-sided. To obtain complementary semantic information for samples with the same semantics, we construct a local graph for each instance and use a Graph Feature Extractor (GFE) to reconstruct the sample representation based on the adjacency relationship between the sample itself and its neighbors. Due to the problem that some cross-modal methods only focus on paired-sample learning and can no longer integrate cross-modal information from the other modalities, we propose a cross-modal graph attentional strategy to generate the graph attentional representation for each sample on the local graph its corresponding paired sample. To eliminate heterogeneous gaps between modalities, we merge the features of the two modalities using a recurrent gated memory network to select prominent features from other modalities and filter out unimportant information to obtain a more differentiated feature representation in the shared latent space. Experiments on four benchmark datasets demonstrate the superiority of our proposed model compared to state-of-the-art cross-modal methods.

    • research article

      Multi-view intermodality display with progressive fusion for image-text matching

      Neurocomputing, Band 535, 2023, S. 1-12

      Recently, image-text matching has been intensively researched to bridge vision and language. Previous methods examine an intermodality relationship between an image-text pair from the single-view feature. However, it is difficult to discover all the abundant information based on a single intermodal relationship. In this paper, a novel Multi-View Inter-Modality Representation with Progressive Fusion (MIRPF) is developed to study intermodality relationships from multi-view features. The multi-view strategy offers more complementary and global semantic cues than single-view approaches. In particular, the multi-view intermodality representation network is built to generate multiple intermodality representations that provide different views to discover the latent image-text relationships. In addition, the progressive fusion module is performed to progressively fuse intermodal features, fully exploiting the inherent complementarity between different views. Extensive experiments on Flickr30K and MSCOCO demonstrate the superiority of MIRPF over several existing approaches. The code is available at: https://github.com/jasscia18/MIRPF.

      (Video) 225 - Attention U-net. What is attention and why is it needed for U-Net?
    • research article

      Complementarity is King: Multimodal and Multigranular Hierarchical Semantic Enhancement Network for Cross-Modal Retrieval

      Expert Systems with Applications, Volume 216, 2023, Article 119415

      Cross-modal retrieval requires a query from one modality to retrieve relevant results from another modality, and its main problem is how to learn cross-modal similarity. Note that the full semantic information of a given concept is widely scattered across the multimodal and multigranular data and cannot be fully captured by most existing methods to accurately learn cross-modal similarity. Therefore, we propose a multimodal and multigranular hierarchical semantic enhancement network (M2HSE), which includes two stages to obtain more complete semantic information by fusing complementarity in multimodal and multigranular data. In stage 1, two classes of cross-modal similarity (primary similarity and auxiliary similarity) are computed more comprehensively in two sub-meshes. In particular, the primary similarities of two sub-networks are fused to perform the cross-modal search, while the auxiliary similarity provides a valuable complement to the primary similarity. In Stage 2, the multi-spring balance loss is proposed to more flexibly optimize cross-modal similarity. Exploiting this loss, the most representative samples are selected to establish the multi-spring equilibrium system, which adaptively optimizes the cross-modal similarities until the equilibrium state is reached. Extensive experiments performed on public benchmark datasets clearly demonstrate the effectiveness of our proposed method and demonstrate its competitiveness over the prior art.

    Bi-Attention Enhanced Representation Learning for Image-Text Matching (3)

    Yumin Tianreceived BSc and MSc degrees in Computer Applications from Xidian University, China in 1984 and 1987 respectively. She is currently a professor in the School of Computer Science and Technology at Xidian University. Her research interests include image processing, 3D shape restoration, digital watermarking, and computer vision.

    Bi-Attention Enhanced Representation Learning for Image-Text Matching (4)

    Aqiang-Dingreceived BS degree in Computer Science and Technology from Xidian University, Xi'an, China in 2019. He is currently pursuing MS degree in Computer Science and Technology at Xidian University. His research interests focus on machine learning and multimedia information retrieval.

    Bi-Attention Enhanced Representation Learning for Image-Text Matching (5)

    in the moneyreceived his PhD in Intelligent Information Processing from Xidian University, Xi'an, China in 2016. She is currently an Associate Professor in the School of Computer Science and Technology at Xidian University. Her research interests include machine learning and multimedia information retrieval.

    Bi-Attention Enhanced Representation Learning for Image-Text Matching (6)

    Xuemei Luoreceived his PhD in Computer Systems Structure from Xidian University, Xi'an, China in 2012. She is currently a lecturer at Xidian University's School of Computer Science and Technology. Her research interests include color management, graphics and image processing, and machine learning.

    Bi-Attention Enhanced Representation Learning for Image-Text Matching (7)

    Bo Wanreceived BS, MS and PhD degrees from Xidian University, Xi'an, Shaanxi, China. He is currently a professor in the School of Computer Science and Technology at Xidian University. His current research interests include input/output technologies and systems, human-computer interaction, and cloud computing.

    Bi-Attention Enhanced Representation Learning for Image-Text Matching (8)

    Yifeng Wangreceived his PhD in Computer Science and Technology from Xidian University, Xi'an, China in 2009. He is currently Associate Professor at the School of Computer Science and Technology, Xidian University. His research interests focus on machine learning and computer vision.

    Show full text

    © 2023 Elsevier Ltd. All rights reserved.

    (Video) [ML News] DeepMind's Flamingo Image-Text model | Locked-Image Tuning | Jurassic X & MRKL

    Videos

    1. BLIP 2 Image Captioning Visual Question Answering Explained ( Hugging Face Space Demo )
    (Rithesh Sreenivasan)
    2. Attention in Neural Networks
    (CodeEmporium)
    3. Data-efficient Image Transformers EXPLAINED! Facebook AI's DeiT paper
    (AI Coffee Break with Letitia)
    4. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
    (ComputerVisionFoundation Videos)
    5. ImageBERT
    (Connor Shorten)
    6. Lecture 2 - Introduction To Transformers
    (UCF CRCV)
    Top Articles
    Latest Posts
    Article information

    Author: Greg O'Connell

    Last Updated: 06/02/2023

    Views: 5688

    Rating: 4.1 / 5 (62 voted)

    Reviews: 93% of readers found this page helpful

    Author information

    Name: Greg O'Connell

    Birthday: 1992-01-10

    Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

    Phone: +2614651609714

    Job: Education Developer

    Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

    Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.