Special Issue

Topic: Vision-and-Language Intelligence: From Image Understanding to Multimodal Reasoning

A Special Issue of Intelligence & Robotics

ISSN 2770-3541 (Online)

Submission deadline: 25 Nov 2025

Guest Editors

Assoc. Prof. Qi Wu
School of Computer and Mathematical Sciences, The University of Adelaide, Adelaide, Australia.
Dr. Feras Dayoub
Responsible AI Research Centre, Australian Institute for Machine Learning, The University of Adelaide, Adelaide, Australia.
Dr. Jason Xue
Responsible AI Research Centre, Australian Institute for Machine Learning, The University of Adelaide, Adelaide, Australia.

Guest Editor Assistant

Dr. Arpit Garg
Responsible AI Research Centre, Australian Institute for Machine Learning, The University of Adelaide, Adelaide, Australia.

Special Issue Introduction

In recent years, the integration of computer vision and natural language processing has fostered significant advancements in Vision-and-Language intelligence. By bridging the gap between visual perception and linguistic reasoning, this interdisciplinary field has enabled machines to understand, reason about, and generate information across modalities. The rapid development of large-scale foundation models and multimodal transformers has fueled breakthroughs in tasks such as image captioning, visual question answering (VQA), visual dialogue, and cross-modal retrieval. These milestones reflect the field's progression toward more general-purpose and human-aligned artificial intelligence.

 

This Special Issue seeks to bring together the latest research that advances our understanding of multimodal intelligence, particularly focusing on how machines comprehend and interact with visual content through natural language. We aim to highlight both theoretical and practical contributions that push the boundaries of vision-language integration and establish new directions for future research. Potential topics of this Special Issue include but are not limited to the following:

● Image and video captioning, grounding, and dense description;

● VQA, reasoning, and visual dialogue systems;

● Multimodal pretraining, foundation models, and large-scale transformer architectures;

● Cross-modal retrieval, generation, and representation learning;

● Scene understanding through contextual and semantic integration;

● Multimodal data fusion and common-sense reasoning;

● Continual, zero-shot, and few-shot learning in multimodal settings;

● Lightweight, efficient, and deployable vision-language models;

● Applications in education, healthcare, robotics, assistive technologies, etc.

Keywords

Multimodal learning, visual question answering (VQA), image and video captioning, vision-language retraining, multimodal transformers, cross-modal retrieval, visual dialogue, scene understanding, multimodal reasoning

Submission Deadline

25 Nov 2025

Submission Information

For Author Instructions, please refer to https://www.oaepublish.com/ir/author_instructions
For Online Submission, please login at https://www.oaecenter.com/login?JournalId=ir&IssueId=ir25062510131
Submission Deadline: 25 Nov 2025
Contacts: Amber Ren, Managing Editor, [email protected]

Published Articles

Coming soon
Intelligence & Robotics
ISSN 2770-3541 (Online)

Portico

All published articles are preserved here permanently:

https://www.portico.org/publishers/oae/

Portico

All published articles are preserved here permanently:

https://www.portico.org/publishers/oae/