Skip to the content.

Task description: SHROOM participants will need to detect grammatically sound output that contains incorrect semantic information (i.e. unsupported or inconsistent with the source input), with or without having access to the model that produced the output.

Overview of the task: The modern NLG landscape is plagued by two interlinked problems: On the one hand, our current neural models have a propensity to produce inaccurate but fluent outputs; on the other hand, our metrics are most apt at describing fluency, rather than correctness. This leads neural networks to “hallucinate”, i.e., produce fluent but incorrect outputs that we currently struggle to detect automatically. For many NLG applications, the correctness of an output is however mission critical. For instance, producing a plausible-sounding translation that is inconsistent with the source text puts in jeopardy the usefulness of a machine translation pipeline. With our shared task, we hope to foster the growing interest in this topic in the community.

With SHROOM we adopt a post hoc setting, where models have already been trained and outputs already produced: participants will be asked to perform binary classification to identify cases of fluent overgeneration hallucinations in two different setups: model-aware and model-agnostic tracks. That is, participants must detect grammatically sound outputs which contain incorrect or unsupported semantic information, inconsistent with the source input, with or without having access to the model that produced the output. To that end, we will provide participants with a collection of checkpoints, inputs, references and outputs of systems covering three different NLG tasks: definition modeling (DM), machine translation (MT) and paraphrase generation (PG), trained with varying degrees of accuracy. The development set will provide binary annotations from at least five different annotators and a majority vote gold label.

Join the mailing group:

Codalab competition page

Follow us on Twitter

[TRAIN unlabeled] Download unlabeled training data - Google drive link (Updated November 8, 2023): unlabeled model agnostic and model aware training data (in v2 only the model-aware file has been updated).
NOTE: No labeled training data is provided, however, feel free to use any existing methods and/or datasets to develop your systems, such as Friel et al. 2023, Guerreiro et al. 2023, Li et al. 2023, and the EdinburghNLP github repository.

[TRIAL] Download trial data - Google drive link (Updated August 2, 2023): Trial data including README file (in v1.1 only the README file has been updated).

[DEV] Download validation data - Google drive link (Updated November 3, 2023): Validation data including README file (in v2 only the README and the model-aware file have been updated).

[Baseline Kit] Download baseline participant kit - Google drive link (Updated December 19, 2023): baseline participant kit including README file.

[TEST labeled] Download labeled test data - Google drive link (Updated February 2, 2024): labeled model agnostic and model aware test data.

[Ranking Submission TEST] - Following the final ranking of all submissions on the test data: (Updated February 2, 2024)

Important dates for task participants

Organizers of the shared task: