Test-time training (TTT) significantly enhances language models' abstract reasoning, improving accuracy up to 6x on the Abstraction and Reasoning Corpus (ARC). Key factors for successful TTT include initial fine-tuning, auxiliary tasks, and per-instance training. Applying TTT to an 8B-parameter model boosts accuracy to 53% on ARC's public validation set, nearly 25% better than previous public, neural approaches. Ensemble with recent program generation methods achieves 61.9% accuracy, matching average human scores. This suggests that, in addition to explicit symbolic search, test-time training on few-shot examples significantly improves abstract reasoning in neural language models.

  • SocialistDovahkiin [she/her]
    ·
    1 month ago

    achieving similar statistical accuracy when training off of large datasets which probably have the answers to a lot of the parts of these benchmarks doesn't seem too impressive

    • AtmosphericRiversCuomo [none/use name]
      hexagon
      ·
      1 month ago

      The training datasets don't have the answers because the benchmark is diverse enough. That's why other models struggled to perform as well as humans until they applied the approach outlined in the paper. This is the benchmark: https://liusida.github.io/ARC/