Skip to the content.

The First Workshop of Evaluation of Multi-Modal Generation

Multimodal generation techniques have opened new avenues for creative content generation. However, evaluating the quality of multimodal generation remains underexplored and some key questions are unanswered, such as the contributions of each modal, the utility of pre-trained large language models for multimodal generation, and measuring faithfulness and fairness in multimodal outputs. This workshop aims to foster discussions and research efforts by bringing together researchers and practitioners in natural language processing, computer vision, and multimodal AI. Our goal is to establish evaluation methods for multimodal research and advance research efforts in this direction.

Schedule

Date: 20 January 2025 (Monday)
Venue: Abu Dahbi National Exhibition Center , Capital Suite 10
All times are Abu Dhabi local time, Gulf Standard Time (GST), UTC+4

Time Presentation Details
9:00 - 9:10 Opening
9:10 - 10:10 Keynote I - A/Prof Qi Wu
Topic: Reasoning is Measurable: Two new evaluation datasets & metrics on LLMs and MLLMs
10:10 - 10:30 Paper presentation
CVT5: Using Compressed Video Encoder and UMT5 for Dense Video Captioning
Authors: Mohammad Javad Pirhadi, Motahhare Mirzaei and Sauleh Eetemadi
10:30 - 11:00 Conference tea break
11:00 - 12:00 Keynote II - Prof Timothy Baldwin
Topic: Evaluating The "Humanism" of Foundation Models: Culture and Safety
12:00 - 12:40 Paper presentation
TaiwanVQA: A Benchmark for Visual Question Answering for Taiwanese Daily Life
Authors: Hsin-Yi Hsieh, Shang Wei Liu, Chang Chih Meng, Shuo-Yueh Lin, Chen Chien-Hua, Hung-Ju Lin, Hen-Hsen Huang and I-Chen Wu

LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model
Authors: Tao Sun, Oliver Liu, JinJin Li and Lan Ma

(Invited) ACE-M^3: Automatic Capability Evaluator for Multimodal Medical Models
Authors: Xiechi Zhang, Shunfan Zheng, Linlin Wang, Gerard de Melo, Zhu Cao, xiaoling Wang and Liang He
13:00 - 14:00 Conference lunch
14:00 - 15:00 Keynote III - Dr Yova Kementchedjhieva
Topic: Fine-grained Image Caption Generation and Evaluation
15:00 - 15:20 Paper presentation
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Authors: Neelabh Sinha, Vinija Jain and Aman Chadha
15:30 - 16:00 Conference tea break
16:00 - 16:50 Papers presentation
If I feel smart, I will do the right thing: Combining Complementary Multimodal Information in Visual Language Models
Authors: Yuyu Bai and Sandro Pezzelle

A Dataset for Programming-based Instructional Video Classification and Question Answering
Authors: Sana Javaid Raja, Adeel Zafar and Aqsa Shoaib

Persian in a Court: Benchmarking VLMs In Persian Multi-Modal Tasks
Authors: Farhan Farsi, Shahriar Shariati Motlagh, Shayan Bali, Sadra Sabouri and Saeedeh Momtazi

Venue

Venue: Abu Dahbi National Exhibition Center , Capital Suite 10

Call for Papers

Both long paper and short papers (up to 8 pages and 4 pages respectively with unlimited references and appendices) are welcomed for submission.

A list of topics relevant to this workshop (but not limited to):

Important Dates

Note: All deadlines are 11:59PM UTC-12:00 (“Anywhere on Earth”)

Submission Instructions

You are invited to submit your papers in our START/SoftConf submission portal. All the submitted papers have to be anonymous for double-blind review. The content of the paper should not be longer than 8 pages for long papers and 4 pages for short papers, strictly following the COLING 2025 templates, with the mandatory limitation section not counting towards the page limit. Supplementary and appendices (either as separate files or appended after the main submission) are allowed. We encourage code link submissions for reproducibility.

Non-archival Option

To promote discussions within the community, our workshop includes non-archival track. Authors have the flexbility to submit their unpublished work or papers accepted to COLING main conference to our workshop. The organisers may offer the opportunity to give oral or poster presentation.

Invited Speakers

Timothy Baldwin

Timothy Baldwin

Professor Tim Baldwin is Provost and Professor of Natural Language Processing at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), in addition to being a Melbourne Laureate Professor in the School of Computing and Information Systems, The University of Melbourne and Chief Scientist of LibrAI, a start-up focused on AI safety. Tim completed a BSc(CS/Maths) and BA(Linguistics/Japanese) at The University of Melbourne in 1995, and an MEng(CS) and PhD(CS) at the Tokyo Institute of Technology in 1998 and 2001, respectively. He joined MBZUAI at the start of 2022, prior to which he was based at The University of Melbourne for 17 years. His research has been funded by organisations including the Australian Research Council, Google, Microsoft, Xerox, ByteDance, SEEK, NTT, and Fujitsu. He is the author of over 500 peer-reviewed publications across diverse topics in natural language processing and AI, in addition to being an ARC Future Fellow, and the recipient of a number of awards at top conferences.

Qi Wu

Qi Wu

Dr Qi Wu is an Associate Professor at the University of Adelaide and was the ARC Discovery Early Career Researcher Award (DECRA) Fellow between 2019-2021. He is the Director of Vision-and-Language at the Australia Institute of Machine Learning. Australian Academy of Science awarded him a J G Russell Award in 2019. He obtained his PhD degree in 2015 and MSc degree in 2011, in Computer Science from the University of Bath, United Kingdom. His research interests are mainly in computer vision and machine learning. Currently, he is working on the vision-language problem, and he is primarily an expert in image captioning and visual question answering (VQA). He has published more than 100 papers in prestigious conferences and journals, such as TPAMI, CVPR, ICCV, ECCV. He is also the Area Chair for CVPR and ICCV.

Yova Kementchedjhieva

Yova Kementchedjhieva

Dr Yova Kementchedjhieva is an assistant professor of Natural Language Processing at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). Her research concerns language generation in multimodal and cross-lingual contexts. She is interested in knowledge grounding and transfer learning, most recently in the area of vision-and-language processing. Prior to joining MBZUAI, Kementchedjhieva was a postdoctoral researcher in the department of computer science at the University of Copenhagen. During her time at the University of Copenhagen, she worked on conditional text generation across a range of tasks, including grammatical error correction, dialog generation and image captioning. Her earlier work concerned multilingual natural language processing, with a focus on cross-lingual embedding alignment. While at Copenhagen, she also worked as a teaching assistant, gave lectures for beginner and advanced NLP courses, and interned at Google LLC. and DataMinr in a researcher capacity.

Program Committee

Organisers