Special Issue
.jpg)
Topic: Vision-and-Language Intelligence: From Image Understanding to Multimodal Reasoning
Guest Editors
Guest Editor Assistant
Special Issue Introduction
In recent years, the integration of computer vision and natural language processing has fostered significant advancements in Vision-and-Language intelligence. By bridging the gap between visual perception and linguistic reasoning, this interdisciplinary field has enabled machines to understand, reason about, and generate information across modalities. The rapid development of large-scale foundation models and multimodal transformers has fueled breakthroughs in tasks such as image captioning, visual question answering (VQA), visual dialogue, and cross-modal retrieval. These milestones reflect the field's progression toward more general-purpose and human-aligned artificial intelligence.
This Special Issue seeks to bring together the latest research that advances our understanding of multimodal intelligence, particularly focusing on how machines comprehend and interact with visual content through natural language. We aim to highlight both theoretical and practical contributions that push the boundaries of vision-language integration and establish new directions for future research. Potential topics of this Special Issue include but are not limited to the following:
● Image and video captioning, grounding, and dense description;
● VQA, reasoning, and visual dialogue systems;
● Multimodal pretraining, foundation models, and large-scale transformer architectures;
● Cross-modal retrieval, generation, and representation learning;
● Scene understanding through contextual and semantic integration;
● Multimodal data fusion and common-sense reasoning;
● Continual, zero-shot, and few-shot learning in multimodal settings;
● Lightweight, efficient, and deployable vision-language models;
● Applications in education, healthcare, robotics, assistive technologies, etc.
Keywords
Multimodal learning, visual question answering (VQA), image and video captioning, vision-language retraining, multimodal transformers, cross-modal retrieval, visual dialogue, scene understanding, multimodal reasoning
Submission Deadline
Submission Information
For Author Instructions, please refer to https://www.oaepublish.com/ir/author_instructions
For Online Submission, please login at https://www.oaecenter.com/login?JournalId=ir&IssueId=ir25062510131
Submission Deadline: 25 Nov 2025
Contacts: Amber Ren, Managing Editor, [email protected]