Here we discuss strategies and results of a small-sized training program based on Italian child-directed speech (less than 3M tokens) for various network architectures. The rationale behind these experiments [1] lies in the attempt to understand the effect of this naturalistic training diet on different models' architecture. Preliminary findings lead us to conclude that: (i) different tokenization strategies produce mildly significant improvements overall, although segmentation aligns more closely with linguistic intuitions in some cases, but not in others; (ii) modified LSTM networks (eMG-RNN variant) with a single layer and a structurally more controlled cell state perform slightly worse in training loss (compared to standard one- and two-layered LSTM models) but better on linguistically critical contrasts. This suggests that standard loss/accuracy metrics in autoregressive training procedures are linguistically irrelevant and, more generally, misleading since the best-trained models produce poorer linguistic predictions ([2], pace [3]). Overall, the performance of these models remains significantly lower compared to that of 7-year-old native-speaker children in the relevant linguistic contrasts we considered [4].

Recurrent Networks are (Linguistically) Better? An Experiment on Small-LM Training on Child-Directed Speech in Italian

Achille Fusco
Writing – Original Draft Preparation
;
Matilde Barbini
Writing – Review & Editing
;
Maria Letizia Piccini Bianchessi
Writing – Review & Editing
;
Veronica Bressan
Writing – Review & Editing
;
Sofia Neri
Writing – Review & Editing
;
Sarah Rossi
Writing – Review & Editing
;
Tommaso Sgrizzi
Writing – Review & Editing
;
Cristiano Chesi
Writing – Original Draft Preparation
2024-01-01

Abstract

Here we discuss strategies and results of a small-sized training program based on Italian child-directed speech (less than 3M tokens) for various network architectures. The rationale behind these experiments [1] lies in the attempt to understand the effect of this naturalistic training diet on different models' architecture. Preliminary findings lead us to conclude that: (i) different tokenization strategies produce mildly significant improvements overall, although segmentation aligns more closely with linguistic intuitions in some cases, but not in others; (ii) modified LSTM networks (eMG-RNN variant) with a single layer and a structurally more controlled cell state perform slightly worse in training loss (compared to standard one- and two-layered LSTM models) but better on linguistically critical contrasts. This suggests that standard loss/accuracy metrics in autoregressive training procedures are linguistically irrelevant and, more generally, misleading since the best-trained models produce poorer linguistic predictions ([2], pace [3]). Overall, the performance of these models remains significantly lower compared to that of 7-year-old native-speaker children in the relevant linguistic contrasts we considered [4].
2024
LSTM, Transformers, Small Language Models (SLM), tokenization, cell state control, LM evaluation
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12076/18997
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact