Content area
We introduce PRefLexOR (Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning), a framework that integrates preference optimization with reinforcement learning (RL) concepts for self-improving scientific reasoning. PRefLexOR employs a recursive approach, refining intermediate steps before producing final outputs in training and inference. It optimizes log odds between preferred and non-preferred responses using an in-situ dataset generation algorithm. A dynamic knowledge graph contextualizes reasoning with retrieval-augmented data. Preference optimization enhances performance via rejection sampling, masking reasoning steps to focus on discovery. Recursive optimization, guided by feedback loops, refines reasoning. This process mirrors biological adaptation, enabling real-time learning. We find that even small models (3B parameters) self-teach deeper reasoning, solving open-domain problems effectively. Our method integrates into existing LLMs and demonstrates success in biological materials science, leveraging multi-agent self-improvement for enhanced reasoning depth and cross-domain adaptability, offering flexibility and integration into larger agentic systems.
Details
Partial differential equations;
Materials science;
Datasets;
Biological materials;
Artificial intelligence;
Modelling;
Optimization techniques;
Knowledge;
Feedback loops;
Interdisciplinary aspects;
Reasoning;
Optimization;
Preferences;
Biomedical materials;
Multiagent systems;
Machine learning;
Biological activity;
Informatics;
Real time;
Large language models;
Knowledge representation;
Natural language;
Recursive methods
1 Massachusetts Institute of Technology, Center for Computational Science and Engineering, Schmarzman College of Computing, Laboratory for Atomistic and Molecular Mechanics (LAMM), Cambridge, USA (GRID:grid.116068.8) (ISNI:0000 0001 2341 2786)