Title: “Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks” (presented at NAACL)
An MIT/CSAIL team evaluated LLMs on counterfactual versions of tasks (i.e., slightly altered variants of familiar tasks) rather than standard test-sets. MIT CSAIL+2MIT News+2
They found that while models perform well on standard (familiar) task instances, they drop significantly on variants (unfamiliar examples). From the MIT news article:
“…much of their performance on the standard tasks is likely not due to general task abilities, but over-fitting to, or directly memorizing from, what they have seen in their training data.” MIT News+1
The overarching conclusion: these LLMs excel in familiar scenarios, but their ability to generalize to truly novel ones is far more limited than often assumed. MIT CSAIL
Since you noted “almost always give from their training data instead of creating something fresh,” this research aligns quite well: