Skip to the content.

The First Workshop of Evaluation of Multi-Modal Generation

Multimodal generation techniques have opened new avenues for creative content generation. However, evaluating the quality of multimodal generation remains underexplored and some key questions are unanswered, such as the contributions of each modal, the utility of pre-trained large language models for multimodal generation, and measuring faithfulness and fairness in multimodal outputs. This workshop aims to foster discussions and research efforts by bringing together researchers and practitioners in natural language processing, computer vision, and multimodal AI. Our goal is to establish evaluation methods for multimodal research and advance research efforts in this direction.

Call for Papers

Both long paper and short papers (up to 8 pages and 4 pages respectively with unlimited references and appendices) are welcomed for submission.

A list of topics relevant to this workshop (but not limited to):

Important Dates

Note: All deadlines are 11:59PM UTC-12:00 (“Anywhere on Earth”)

Submission Instructions

You are invited to submit your papers in our START/SoftConf submission portal. All the submitted papers have to be anonymous for double-blind review. The content of the paper should not be longer than 8 pages for long papers and 4 pages for short papers, strictly following the COLING 2025 templates, with the mandatory limitation section not counting towards the page limit. Supplementary and appendices (either as separate files or appended after the main submission) are allowed. We encourage code link submissions for reproducibility.

Non-archival Option

To promote discussions within the community, our workshop includes non-archival track. Authors have the flexbility to submit their unpublished work or papers accepted to COLING main conference to our workshop. The organisers may offer the opportunity to give oral or poster presentation.

Invited Speakers

Timothy Baldwin

Timothy Baldwin

Professor Tim Baldwin is Provost and Professor of Natural Language Processing at Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), in addition to being a Melbourne Laureate Professor in the School of Computing and Information Systems, The University of Melbourne and Chief Scientist of LibrAI, a start-up focused on AI safety. Tim completed a BSc(CS/Maths) and BA(Linguistics/Japanese) at The University of Melbourne in 1995, and an MEng(CS) and PhD(CS) at the Tokyo Institute of Technology in 1998 and 2001, respectively. He joined MBZUAI at the start of 2022, prior to which he was based at The University of Melbourne for 17 years. His research has been funded by organisations including the Australian Research Council, Google, Microsoft, Xerox, ByteDance, SEEK, NTT, and Fujitsu. He is the author of over 500 peer-reviewed publications across diverse topics in natural language processing and AI, in addition to being an ARC Future Fellow, and the recipient of a number of awards at top conferences.

Qi Wu

Qi Wu

Dr Qi Wu is an Associate Professor at the University of Adelaide and was the ARC Discovery Early Career Researcher Award (DECRA) Fellow between 2019-2021. He is the Director of Vision-and-Language at the Australia Institute of Machine Learning. Australian Academy of Science awarded him a J G Russell Award in 2019. He obtained his PhD degree in 2015 and MSc degree in 2011, in Computer Science from the University of Bath, United Kingdom. His research interests are mainly in computer vision and machine learning. Currently, he is working on the vision-language problem, and he is primarily an expert in image captioning and visual question answering (VQA). He has published more than 100 papers in prestigious conferences and journals, such as TPAMI, CVPR, ICCV, ECCV. He is also the Area Chair for CVPR and ICCV.

Yova Kementchedjhieva

Yova Kementchedjhieva

Dr Yova Kementchedjhieva is an assistant professor of Natural Language Processing at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). Her research concerns language generation in multimodal and cross-lingual contexts. She is interested in knowledge grounding and transfer learning, most recently in the area of vision-and-language processing. Prior to joining MBZUAI, Kementchedjhieva was a postdoctoral researcher in the department of computer science at the University of Copenhagen. During her time at the University of Copenhagen, she worked on conditional text generation across a range of tasks, including grammatical error correction, dialog generation and image captioning. Her earlier work concerned multilingual natural language processing, with a focus on cross-lingual embedding alignment. While at Copenhagen, she also worked as a teaching assistant, gave lectures for beginner and advanced NLP courses, and interned at Google LLC. and DataMinr in a researcher capacity.

Organisers