shroom

NEW

[Oct 10th] Make sure to use updated HI ML TE BN test data files!!!!!

[Oct 5th] TEST SET IS OUT!!!!!

Submission Platform: Participants will be able to submit their solutions and see the leaderboard on this platform : https://shroomcap.pythonanywhere.com/. Go register now !!!

Mu-SHROOM

Welcome to the official shared task website for SHROOM-CAP, a CHOMPS 2025 shared task!

SHROOM-CAP stands for “Shared-task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications”. SHROOM-CAP will invite participants to detect hallucination in the outputs of LLMs in a scientific context. This shared task extends our previous iteration, SHROOM, with a few key changes:

We focus on LLM outputs for scientific domain;
We’re looking at crosslingual setting with both high-resource languages such as, English, Spanish, French, Hindi and suprisal low-level languages;
Participants will have to detect if hallucination occurs or not.

The information on this website is subject to change. We will send announcements for any major update on the Google group mailing list.

What is SHROOM-CAP?

The task consists of detecting presence of scientific hallucinations. Participants are asked to determine if a given scientific text produced by LLMs constitute hallucinations. The task is held in a cross-lingual setting, i.e., we provide data in multiple mixed languages produced by a variety of public-weights LLMs.´

In practice, we provide an LLM output (as a string of characters, a list of tokens, and a list of logits), and participants have to predict if the LLM output string contains a hallucination (binary classification).

Participants are free to use any approach they deem appropriate, including using external resources, and work on any subset of languages they are interested in.

How will participants be evaluated?

Participants will be evaluated for performing binary classification to identify cases of scientific hallucinations. This will be done using via macro-F1 score for two criterions: (i) Factual Mistakes and (ii) Fluency Mistakes

Rankings and submissions will be done separately per language.

Participant info

To participate, the participants need to register via https://forms.gle/hWR9jwTBjZQmFKAE7. This form will enable us add the participants on the google group for further communication.

Data

Below are links to access the data already released, as well as provisional expected release dates for future splits. Do note that release dates are subject to change.

Dataset split	Access	Description
Train Set v1	download (train1)	Contains languages: `en, hi, es, fr, it`
Sample Testing data	download (test-sample)	Contains format of sample test set
Validation Set	download (validation)	Contains languages: `en, hi, es, fr, it`
Train Set v2	download (train2)	Contains languages: `en, hi, es, fr, it`
Test Set	download (unlabled test)	Contains languages: `en, hi, es, fr, it` IndicLanguage : `bn, te, ml, gu`

Important dates

This information is subject to change.

~~Starter Release – July 28~~
~~Training Phase July 28 – October 5, 2025~~
~~Testing Phase October 5 – October 16, 2025~~
~~Paper Submission Deadline October 25, 2025~~
Notification of Acceptance ~~November 3, 2025~~ November 6, 2025
Camera-ready Due November 11, 2025
Proceedings Due December 1, 2025
CHOMPS workshop: 23rd December 2025 (co-located with AACL 2025)

** All deadline are until 00h00 GMT timezone**

Final LeaderBoard:

Factuality + Per-Language Rank

Team	BN	EN	ES	FR	GU	HI	IT	ML	TE
smurfcat	0.6913 (1)	0.8627 (2)	0.7592 (1)	0.8595 (1)	0.6413 (1)	0.8364 (1)	0.8700 (1)	0.6487 (1)	0.7164 (1)
nsu-ai	0.5251 (2)	0.5116 (5)	0.5354 (3)	0.6612 (2)	0.5032 (2)	0.4771 (3)	0.7418 (2)	0.5220 (2)	0.5004 (3)
baseline	0.4320 (3)	0.5266 (4)	0.5153 (4)	0.4819 (3)	0.4796 (3)	0.4401 (4)	0.4861 (3)	0.5428 (3)	0.5012 (2)
CUET_Goodfellas	—	0.6483 (3)	0.7243 (2)	—	—	—	—	—	—
medusa	—	0.9191 (1)	—	—	—	—	—	—	—
Scalar_nitk	—	—	—	—	—	0.5449 (2)	—	—	—

Fluency + Per-Language Rank

Team	BN	EN	ES	FR	GU	HI	IT	ML	TE
smurfcat	0.7430 (1)	0.7000 (1)	0.6382 (1)	0.8516 (1)	0.6735 (1)	0.8773 (1)	0.6325 (1)	0.7398 (1)	0.8905 (1)
nsu-ai	0.7079 (2)	0.6118 (3)	0.5282 (3)	0.5206 (2)	0.5572 (2)	0.7536 (3)	0.5019 (2)	0.6964 (2)	0.4027 (3)
baseline	0.4845 (3)	0.4413 (5)	0.4306 (4)	0.4873 (3)	0.4938 (3)	0.4119 (4)	0.4656 (3)	0.4983 (3)	0.4657 (2)
CUET_Goodfellas	—	0.5488 (4)	0.5913 (2)	—	—	—	—	—	—
Scalar_nitk	—	—	—	—	—	0.8349 (2)	—	—	—
medusa	—	0.6253 (2)	—	—	—	—	—	—	—

Organizers of the shared task

Aman Sinha, Université de Lorraine, France
Federica Gamba, Charles University, Prague
Raúl Vázquez, University of Helsinki, Finland
Timothee Mickus, University of Helsinki, Finland
Laura Zanella, Independent Researcher
Ahana Chattopadhyay, Orange Research, France
Yash Kankanampati, Université Sorbonne Paris Nord, France
Binesh Arakkal Remesh, Université de Lorraine, France
Aryan Chandramania, IIIT Hyderabad, India

Looking for something else?

The websites for all the iterations of the shared task are available here:

Shroom-CAP Shared Task

Welcome to CHOMPS 2025 Shared Task – SHROOM-CAP, the Shared-task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications

** NEW **