This paper investigates whether large language models (LLMs) generalize core syntactic properties associated with restructuring verbs in Italian, a domain tied to the universal hierarchy of functional heads proposed by Cinque [1, 2]. Specifically, we examine whether LLMs distinguish between restructuring and control verbs based on canonical syntactic diagnostics: verb ordering, clitic climbing, and auxiliary selection. We also probe how models interpret novel infinitive-selecting pseudoverbs, testing whether they default to restructuring- or control-like behavior. Using controlled minimal pairs, we evaluate five models of different sizes: Minerva-7B-base-v1.0 [3], GPT2-medium-italian-embeddings [4], Bert-base-italian-xxl-uncased [5], GPT2-small-italian [4], , and GePpeTto [6]. Our findings reveal that none of the models internalize the functional hierarchy, nor do they systematically block clitic climbing for control verbs, or are sensitive to auxiliary selection variability of the restructuring and control classes. These results highlight fundamental limitations in the syntactic generalization abilities of current LLMs, particularly in domains where structural contrasts are not overtly marked in the input.

Structural sensitivity does not entail grammaticality: assessing LLMs against the Universal Functional Hierarchy

Tommaso Sgrizzi
Writing – Original Draft Preparation
;
Asya Zanollo
Writing – Original Draft Preparation
;
Cristiano Chesi
Supervision
2025-01-01

Abstract

This paper investigates whether large language models (LLMs) generalize core syntactic properties associated with restructuring verbs in Italian, a domain tied to the universal hierarchy of functional heads proposed by Cinque [1, 2]. Specifically, we examine whether LLMs distinguish between restructuring and control verbs based on canonical syntactic diagnostics: verb ordering, clitic climbing, and auxiliary selection. We also probe how models interpret novel infinitive-selecting pseudoverbs, testing whether they default to restructuring- or control-like behavior. Using controlled minimal pairs, we evaluate five models of different sizes: Minerva-7B-base-v1.0 [3], GPT2-medium-italian-embeddings [4], Bert-base-italian-xxl-uncased [5], GPT2-small-italian [4], , and GePpeTto [6]. Our findings reveal that none of the models internalize the functional hierarchy, nor do they systematically block clitic climbing for control verbs, or are sensitive to auxiliary selection variability of the restructuring and control classes. These results highlight fundamental limitations in the syntactic generalization abilities of current LLMs, particularly in domains where structural contrasts are not overtly marked in the input.
2025
Large language models (LLMs), Cognitive plausibility, Syntactic evaluation, Universal hierarchy of functional heads, Restructuring verbs
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12076/22482
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact