This paper investigates whether large language models (LLMs) generalize core syntactic properties associated with restructuring verbs in Italian, a domain tied to the universal hierarchy of functional heads proposed by Cinque [1, 2]. Specifically, we examine whether LLMs distinguish between restructuring and control verbs based on canonical syntactic diagnostics: verb ordering, clitic climbing, and auxiliary selection. We also probe how models interpret novel infinitive-selecting pseudoverbs, testing whether they default to restructuring- or control-like behavior. Using controlled minimal pairs, we evaluate five models of different sizes: Minerva-7B-base-v1.0 [3], GPT2-medium-italian-embeddings [4], Bert-base-italian-xxl-uncased [5], GPT2-small-italian [4], , and GePpeTto [6]. Our findings reveal that none of the models internalize the functional hierarchy, nor do they systematically block clitic climbing for control verbs, or are sensitive to auxiliary selection variability of the restructuring and control classes. These results highlight fundamental limitations in the syntactic generalization abilities of current LLMs, particularly in domains where structural contrasts are not overtly marked in the input.
Structural sensitivity does not entail grammaticality: assessing LLMs against the Universal Functional Hierarchy
Tommaso Sgrizzi
Writing – Original Draft Preparation
;Asya ZanolloWriting – Original Draft Preparation
;Cristiano ChesiSupervision
2025-01-01
Abstract
This paper investigates whether large language models (LLMs) generalize core syntactic properties associated with restructuring verbs in Italian, a domain tied to the universal hierarchy of functional heads proposed by Cinque [1, 2]. Specifically, we examine whether LLMs distinguish between restructuring and control verbs based on canonical syntactic diagnostics: verb ordering, clitic climbing, and auxiliary selection. We also probe how models interpret novel infinitive-selecting pseudoverbs, testing whether they default to restructuring- or control-like behavior. Using controlled minimal pairs, we evaluate five models of different sizes: Minerva-7B-base-v1.0 [3], GPT2-medium-italian-embeddings [4], Bert-base-italian-xxl-uncased [5], GPT2-small-italian [4], , and GePpeTto [6]. Our findings reveal that none of the models internalize the functional hierarchy, nor do they systematically block clitic climbing for control verbs, or are sensitive to auxiliary selection variability of the restructuring and control classes. These results highlight fundamental limitations in the syntactic generalization abilities of current LLMs, particularly in domains where structural contrasts are not overtly marked in the input.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


