Content area
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including code generation. However, complex inductive reasoning, deriving general rules from limited observations, remains a significant challenge. Programming-by-Examples (PBE) aims to synthesize programs from input-output examples, representing an important inductive reasoning task in programming languages with practical applications. We propose an approach to enhance LLMs on PBE using code-grounded synthetic data generation to provide high-quality training data for finetuning LLMs and address the scarcity of domain-specific data. Furthermore, we demonstrate how scaling test-time computation significantly improves inference results in this PBE setting. Our approach achieves state-of-the-art results on common PBE benchmarks including string, number sequence, and logo graphics domains. We further extend our methods to ARC-AGI, a very challenging benchmark requiring visual inductive reasoning from a few examples involving concepts such as physics, objects and symmetry. By applying our synthetic data and test-time scaling method, and then combining with transduction, we can approach human-level performance on ARC-AGI, demonstrating the framework's effectiveness even in highly challenging, visually-grounded domains.
Unlike PBE and ARC-AGI tasks where examples enable direct validation, real-world code generation often begins with ambiguous natural language specifications. This inherent ambiguity creates uncertainty about code correctness. We develop an approach that samples both code and tests from LLMs and uses execution results to build a classifier that estimates correctness probabilities. The method produces human-interpretable predicates explaining code behavior, a feature that users preferred in the user study, and helps create more trustworthy program synthesis while maintaining state-of-the-art accuracy.
