This paper discusses the results of various experiments assessing the morphosyntactic and semantic competence in Italian of four very large language models (vLLMs): davinci (GPT-3/ChatGPT), davinci-002, davinci-003 (both GPT-3.5 models) and gpt-4-1106-preview (GPT-4). We evaluated these models on (i) acceptability, (ii) complexity, and (iii) coherence judgments using 7-point Likert scales and on (iv) syntactic development through a forced choice task. The test sets were drawn from shared NLP tasks and standard linguistic assessments. The results suggest that, although fine-tuned transformers outperform all GPT models, GPT-4 represents a significant improvement over third-generation GPT models. According to our tests, even if GPT-4 and fine-tuned transformers cannot be considered descriptively or explanatorily adequate, they nonetheless pose a challenge to the poverty of the stimulus hypothesis. The "theory" expressed by GPT models is not linguistically intelligible in any relevant sense, and their training data is orders of magnitude larger than the primary linguistic input available to children. Nevertheless, GPT-4 captures certain generalizations, such as the constraints blocking the insertion of an overt resumptive clitic in specific gap positions, that are arguably unlearnable from just primary positive data.

Large Language Models Under Evaluation: An Acceptability, Complexity And Coherence Assessment In Italian

Cristiano Chesi
Writing – Original Draft Preparation
;
2025-01-01

Abstract

This paper discusses the results of various experiments assessing the morphosyntactic and semantic competence in Italian of four very large language models (vLLMs): davinci (GPT-3/ChatGPT), davinci-002, davinci-003 (both GPT-3.5 models) and gpt-4-1106-preview (GPT-4). We evaluated these models on (i) acceptability, (ii) complexity, and (iii) coherence judgments using 7-point Likert scales and on (iv) syntactic development through a forced choice task. The test sets were drawn from shared NLP tasks and standard linguistic assessments. The results suggest that, although fine-tuned transformers outperform all GPT models, GPT-4 represents a significant improvement over third-generation GPT models. According to our tests, even if GPT-4 and fine-tuned transformers cannot be considered descriptively or explanatorily adequate, they nonetheless pose a challenge to the poverty of the stimulus hypothesis. The "theory" expressed by GPT models is not linguistically intelligible in any relevant sense, and their training data is orders of magnitude larger than the primary linguistic input available to children. Nevertheless, GPT-4 captures certain generalizations, such as the constraints blocking the insertion of an overt resumptive clitic in specific gap positions, that are arguably unlearnable from just primary positive data.
2025
LLM evaluation, GPT-3, GPT-4, poverty of stimulus, language acquisition, syntactic complexity, semantic coherence
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12076/23778
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact