

专题征稿 | 用于视频理解的多模态学习、时序建模及基础模型

MIR专题"Multimodal Learning, Temporal Modeling, and Foundation  Models for Video Understanding"现公开征集原创稿件,截稿日期为2025年3月31日。欢迎赐稿!


Special Issue on Multimodal Learning, Temporal Modeling, and Foundation  Models for Video Understanding专题简介

Video understanding focuses on interpreting dynamic visual information from video data to recognize  objects, actions, interactions, and environments in a time-structured manner. It has emerged as a critical  area of research in computer vision due to its wide-ranging applications in autonomous systems, video surveillance, entertainment, healthcare, and human-computer interaction. Recent advancements in deep  learning, especially in areas like spatiotemporal processing, multimodal learning, and graph-based  modeling, have significantly enhanced model’s ability to comprehend complex video scenes. Despite significant advancements, the following key challenges continue to pose obstacles to developing accurate, efficient, and robust systems:

1. High Dimensionality and Computational Complexity: Videos are inherently high-dimensional data, with multiple frames contributing to a vast amount of information. Analyzing these sequences requires models to be capable of efficiently processing both spatial and temporal  information, which often leads to high computational costs. Balancing the accuracy of video understanding models with the need for real-time processing in applications such as autonomous  driving or video surveillance is a pressing challenge. 

2. Temporal Coherence and Long-term Dependencies: Understanding events in a video often relies  on tracking objects and interpreting their actions over time. Capturing temporal coherence, especially over long sequences, is difficult due to the need to model both short-term interactions  and long-term dependencies between different entities. Traditional methods struggle with maintaining consistent object tracking and event detection across extended time frames. 

3. Multimodal Integration: Video data encompasses more than just visual information—auditory cues, text description, and motion data are also essential for comprehending scenes. The challenge lies in effectively fusing these modalities to provide a holistic understanding of the  scene. Many systems still struggle with aligning and integrating multimodal inputs in a meaningful  way that improves recognition and interpretation accuracy. 

4. Ambiguity in Action and Event Recognition: Distinguishing between similar actions or events in a video can be highly ambiguous. For example, the actions of sitting down and falling can appear  visually similar, yet have vastly different meanings. Accurately recognizing and categorizing these nuanced actions requires models with a deep understanding of spatiotemporal context, which is challenging to achieve, especially in complex environments with multiple actors and activities. 

5. Occlusion and Viewpoint Variations: In real-world scenarios, objects or people in a video often  get occluded or appear from different angles. These occlusions and viewpoint changes can obscure key parts of the scene, leading to ambiguity in identifying actions and objects. Models  need to be robust enough to handle partial visibility, changes in camera angles, and dynamic  environments, but current systems frequently fall short in such situations. 

6. Data Annotation and Scalability: Training effective video understanding models often requires  large, annotated datasets. However, manually labeling video data is time-consuming and expensive, particularly when considering both spatial and temporal dimensions. The scalability of  current solutions is limited by the availability of largely annotated datasets, and the development of models capable of learning from less data or through self-supervision is still in its infancy. 

7. Adaptation and Generality: Many state-of-the-art models are trained on curated datasets that  may not fully represent the complexity of real-world environments. When deployed in the real  world, these models often encounter variations in lighting, weather, and unpredictable interactions, leading to performance degradation. Ensuring that models can generalize to unseen environments and adapt to changing conditions is an ongoing challenge.


We believe that this special issue will offer a timely collection of research outcomes to benefit video  understanding in the long run. Topics of interest include but are not limited to: 

• Temporal Dynamics and Spatiotemporal Feature Extraction: Leveraging advanced techniques to model temporal dependencies and relationships between objects and events over time, e.g.,  graph neural networks and transformers. 

• Multimodal Learning for Video Understanding: Integrating visual, auditory, text, or motion  information to improve scene comprehension. 

• Scene Segmentation in Videos: Enhancements in accurately segmenting dynamic scenes across frames, e.g., video semantic segmentation, video instance segmentation, video panoptic segmentation, video object segmentation, motion segmentation, scene change detection, interactive video segmentation, and video salient object detection.

• Object Tracking in Videos: Advancements in accurately tracking objects across video frames,  e.g., single/multiple object tracking, long-term object tracking, trajectory prediction, video object/person re-identification, multimodal object tracking, 3D object tracking, and joint tracking and segmentation. 

• Action Recognition and Event Detection: New methods for identifying and distinguishing complex actions and events in videos, e.g., action segmentation, video summarization/captioning, action label prediction, video prediction, video retrieval, procedure  and action understanding, and video grounding. 

• Data/Label Efficient Video Learning: Developing new techniques for self-supervised learning, unsupervised learning, few-shot learning, and semi-supervised learning with videos. 

• Personalization of Large Foundation Models for Video Understanding: Advanced techniques for personalizing large foundation models (LFMs) for video understanding, e.g., using LFMs for  video segmentation and tracking.


1) 截稿日期:2025年3月31日

2) 投稿地址(已开通)


“Step 6 Details & Comments: Special Issue and Special Section---Special Issue on Multimodal Learning, Temporal Modeling, and Foundation  Models for Video Understanding”.

3) 投稿及同行评议指南:

Full length manuscripts and peer reviewing will follow the MIR guidelines. For details:


Yun Liu

Agency for Science, Technology and Research (A*STAR), Singapore


Guolei Sun

ETH Zurich, Switzerland


Radu Timofte

University of Wurzburg, Germany & ETH Zurich, Switzerland


Ender Konukoglu

ETH Zurich, Switzerland


Luc Van Gool

ETH Zurich, Switzerland & KU Leuven, Belgium & Institute for Computer Science,  Artificial Intelligence and Technology (INSAIT), Bulgaria



Machine Intelligence Research




关于Machine Intelligence Research

Machine Intelligence Research(简称MIR,原刊名International Journal of Automation and Computing)由中国科学院自动化研究所主办,于2022年正式出版。MIR立足国内、面向全球,着眼于服务国家战略需求,刊发机器智能领域最新原创研究性论文、综述、评论等,全面报道国际机器智能领域的基础理论和前沿创新研究成果,促进国际学术交流与学科发展,服务国家人工智能科技进步。期刊入选"中国科技期刊卓越行动计划",已被ESCI、EI、Scopus、中国科技核心期刊、CSCD等20余家国际数据库收录,入选图像图形领域期刊分级目录-T2级知名期刊。2022年首个CiteScore分值在计算机科学、工程、数学三大领域的八个子方向排名均跻身Q1区,最佳排名挺进Top 4%,2023年CiteScore分值继续跻身Q1区。2024年获得首个影响因子(IF) 6.4,位列人工智能及自动化&控制系统两个领域JCR Q1区。


