Crossword clue difficulty is traditionally judged by human setters, leaving automated puzzle generators without an objective yard-stick. We model difficulty as the Surprisal of the answer given the clue, estimating it with token probabilities from large language models. Comparing three models three causal LLMs-Llama-3-8B, Llama-2-7B, and Ita-GPT-2-121M. with 60 human solvers on 160 hand-balanced clues, Surprisal correlates negatively with accuracy (r = –0.62 for nominal clues). These results show that language-model Surprisal captures some of the cognitive load humans experience and that language-specific training and model scale both matter; the metric therefore enables adaptive crossword generation and provides a new test-bed for probing the alignment between human and model linguistic processing.
Surprisal and Crossword Clues difficulty: Evaluating Linguistic Processing between LLMs and Humans
Asya ZanolloWriting – Original Draft Preparation
;Achille FuscoWriting – Original Draft Preparation
;Cristiano ChesiConceptualization
2025-01-01
Abstract
Crossword clue difficulty is traditionally judged by human setters, leaving automated puzzle generators without an objective yard-stick. We model difficulty as the Surprisal of the answer given the clue, estimating it with token probabilities from large language models. Comparing three models three causal LLMs-Llama-3-8B, Llama-2-7B, and Ita-GPT-2-121M. with 60 human solvers on 160 hand-balanced clues, Surprisal correlates negatively with accuracy (r = –0.62 for nominal clues). These results show that language-model Surprisal captures some of the cognitive load humans experience and that language-specific training and model scale both matter; the metric therefore enables adaptive crossword generation and provides a new test-bed for probing the alignment between human and model linguistic processing.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


