We present BLiMP-IT, a linguistically-informed benchmark to assess the performance of Italian Language Models (LMs). Inspired by state-of-the-art tools for LM evaluation and informed both by generative theorizing and psycholinguistic metrics, this benchmark tests a rich variety of structures using minimal pair contrasts, i.e., a grammatical sentence and an ungrammatical one minimally differing with respect to a single morphosyntactic property. Prompting the model to assign a probability value to the sentences within each pair, BLiMP-IT tests LMs accuracy, as well as their ability to reach linguistically meaningful generalizations, ultimately offering insights on human-machine comparability and the validity of the Poverty of Stimulus hypothesis.
Language models assessment through linguistically motivated contrasts
Neri, SofiaValidation
;Rossi, SarahData Curation
;Chesi, Cristiano
Writing – Review & Editing
2026-01-01
Abstract
We present BLiMP-IT, a linguistically-informed benchmark to assess the performance of Italian Language Models (LMs). Inspired by state-of-the-art tools for LM evaluation and informed both by generative theorizing and psycholinguistic metrics, this benchmark tests a rich variety of structures using minimal pair contrasts, i.e., a grammatical sentence and an ungrammatical one minimally differing with respect to a single morphosyntactic property. Prompting the model to assign a probability value to the sentences within each pair, BLiMP-IT tests LMs accuracy, as well as their ability to reach linguistically meaningful generalizations, ultimately offering insights on human-machine comparability and the validity of the Poverty of Stimulus hypothesis.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


