Content area
We introduce PRefLexOR (Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning), a framework that integrates preference optimization with reinforcement learning (RL) concepts for self-improving scientific reasoning. PRefLexOR employs a recursive approach, refining intermediate steps before producing final outputs in training and inference. It optimizes log odds between preferred and non-preferred responses using an in-situ dataset generation algorithm. A dynamic knowledge graph contextualizes reasoning with retrieval-augmented data. Preference optimization enhances performance via rejection sampling, masking reasoning steps to focus on discovery. Recursive optimization, guided by feedback loops, refines reasoning. This process mirrors biological adaptation, enabling real-time learning. We find that even small models (3B parameters) self-teach deeper reasoning, solving open-domain problems effectively. Our method integrates into existing LLMs and demonstrates success in biological materials science, leveraging multi-agent self-improvement for enhanced reasoning depth and cross-domain adaptability, offering flexibility and integration into larger agentic systems.
Introduction
Generative artificial intelligence (AI) models, such as large language models (LLMs) and variants1, 2, 3, 4, 5–6 have not only impacted the landscape of natural language processing (NLP) but also unlocked the potential for scientifically-focused models that may ultimately be able to reason, think, and generate insight across an unparalleled range of disciplines. From general-purpose tasks to highly specialized domains like materials science and engineering7, 8, 9, 10, 11, 12, 13, 14–15, a challenge remains to develop strategies that yield more sophisticated scientific reasoning models capable of performing more difficult tasks.
Earlier work has resulted in attempts towards that goal, such as LLMs that were being taught to reason, not simply by brute force or through rote memorization, but by leveraging structured approaches that mimic human thought processes. Chain-of-thought prompting16, for instance, guides models to break complex problems into clear, manageable steps, mimicking the logical progression that human minds follow when faced with a challenging task. Similarly, few-shot learning methods17 give models the ability to handle new tasks with minimal examples, enabling them to adapt their capabilities to novel scenarios.
Yet, applying these powerful models in technical fields like biomateriomics18,19 presents unique challenges. The intricacies of biomaterials design—where insights are drawn from multiscale, cross-disciplinary knowledge—require LLMs to go beyond surface-level understanding. In biomateriomics, researchers seek to explore and model biological systems at different scales, identifying how nature’s building blocks can inspire new materials7,19, 20, 21, 22, 23–24. Models of synthetic intelligence that capture scientific processes used in the analysis of such systems should offer a coherent and integrative strategy for solving cross-disciplinary problems, making them indispensable tools in fields like biomaterials research, where the ability to think, reason, and innovate is crucial. We posit that such advances can be achieved by developing models that can achieve several key objectives, including the ability to ingest rich, diverse, and disparate information from varied sources by forming rigorous internal knowledge representations that can be used to predict actionable outcomes (Fig. 1a). To reach this goal, models need to be developed that go beyond conventional predictions without situational awareness (Fig. 1b) towards more sophisticated models that encompass a higher degree of situational awareness, realized through capabilities of self-reflection, error correction, and exploration of wide space to predict novel solutions (Fig. 1c).
[See PDF for image]
Fig. 1
Illustration of the workflow and design principles behind generative materials informatics.
a The process of transforming information into knowledge and actionable outcomes. Each individual piece of information (left) is synthesized into a network of interconnected knowledge, leading to informed decisions and innovative designs (right). b Conventional approaches in materials science rely on data-driven models, partial differential equations (PDEs), and experimental results, focusing on single-step predictions. c In contrast, generative materials informatics models built on the PRefLexOR framework proposed in this paper use “thinking” and “reflection” explicitly by incorporating iterative reasoning and contextual understanding, allowing for more complex, multi-step predictions. This approach expands from single inference steps, includes multiple modalities of data and responses, integrates real-world feedback and physics, and leverages self-assessment and self-learning. Using reinforcement learning (RL) principles, the discovery of principles or the solution of specific tasks is further inspired by biological paradigms, using bio-inspired neural network designs. These advanced methods support continuous improvement in material predictions, enabling more adaptable and intelligent designs.
Traditional LLMs have shown a certain level of proficiency in generating text, answering questions, and handling a wide range of natural language tasks. However, their capabilities–especially when it comes to reflecting on complex tasks or iterating over ideas to refine thought processes–remain limited, and are often only achieved in very large frontier models.
In many current AI systems, especially in scientific applications, reasoning often follows a single-pass approach, where the model generates outputs without reflecting on the steps that led to its conclusions. This leads to challenges in solving open-domain or multi-step problems where deeper cognitive reflection is required. Furthermore, the lack of flexibility in reasoning, adaptation to new challenges, and real-time learning means that these models struggle to handle tasks that require evolving, recursive reasoning strategies.
To address these challenges, we propose PRefLexOR (Preference-based Recursive Optimization and Refinement), a framework that combines preference optimization with recursive reasoning inspired by Reinforcement Learning (RL) principles (Fig. 1c). PRefLexOR enables models to self-teach by iterating over thought processes, and refining reasoning, and continuously learning from both preferred and rejected outputs. This approach represents a shift towards a more reflective and flexible learning paradigm, where the model improves its decision-making in real-time.
In our method, the dynamic data generation process allows us to build a complex graph of interactions that facilitates recursive reasoning and refinement. For instance, when using a corpus of data sourced from scientific papers13, 14–15, the process begins by generating a question from a randomly selected piece of text, which acts as the initial node in the graph. To answer the question, we employ Retrieval-Augmented Generation (RAG), which queries the entire corpus, retrieving and integrating contextually relevant information from multiple sources. This interaction between the question and the retrieved data forms a graph of knowledge, where nodes represent pieces of text, and edges represent the relationships between them. The embedding model plays a key role in this process by ensuring that similar pieces of information are mapped to adjacent nodes within the graph, facilitating efficient retrieval and reasoning. As the model continues to refine its reasoning across recursive cycles, this graph evolves, reflecting the complex interconnections between various pieces of knowledge and how they contribute to the model’s final output.
In this way, PRefLexOR constructs a dynamic, evolving knowledge graph that supports recursive reasoning, enabling the model to navigate, refine, and synthesize information across a vast corpus, improving the accuracy and coherence of its answers. Figure 2 summarizes the process of strategic dataset generation with structured thought integration.
[See PDF for image]
Fig. 2
Strategic dataset generation process with structured thought integration.
This figure illustrates a novel approach to generating datasets, where random text chunks are selected from raw data sources (e.g., papers, books, documents, notes, etc.) and used to develop question-answer pairs in a structured and strategic manner. a The process begins with raw data, such as research papers or books, which is converted into a markup format. This allows the data to be broken down into smaller, manageable text chunks. These chunks form the basis for generating questions in the subsequent steps. b A random selection of text chunks is used to generate question-answer pairs. This step involves creating a question from the text chunk and deriving an initial answer from the content. However, what distinguishes this approach is the next phase, where a structured reasoning process is applied. c The system incorporates strategic reasoning and reflection, facilitated by the use of special thinking tokens (for instance: <∣
To explain the motivation, traditional methods of training LLMs rely heavily on supervised fine-tuning, where models are trained on static datasets with fixed inputs and outputs. While this allows for the learning of broad patterns, it lacks the ability to dynamically adapt to new reasoning tasks as models focus on learning answers to problems rather than learning the process of responding to tasks. Furthermore, these models are limited in their capacity to engage in multi-step reasoning and reflection, often leading to outputs that lack coherence or depth when faced with complex, multi-faceted problems.
To overcome these limitations, recent advances have introduced preference optimization techniques, such as Odds Ratio Preference Optimization (ORPO)25 and Direct Preference Optimization (DPO)26,27, or variants of these methods28. These methods guide the model to align its outputs with certain preferences (in the context of our particular application, scientific accuracy as identified using the raw corpus of data) by optimizing the log odds between preferred and rejected responses. However, existing implementations do not fully leverage the potential of recursive thinking and iterative refinement.
Additionally, the flexibility required to handle new tasks in real time, without relying on pre-constructed datasets, is often absent from these approaches. This necessitates the development of a model that can both learn autonomously and reflect on its own reasoning to improve continuously, a capability that can be framed in RL terms, where feedback loops and recursive processing drive learning improvements.
Recent studies have introduced self-reflective methodologies aimed at iteratively refining and improving model outputs. As an example, recent developments explored iterative refinement through self-generated feedback loops29. Other work has focused on RL to enhance reasoning and decision-making30. Other research demonstrated how structured self-reflection significantly reduces biases and enhances the ideological neutrality of LLM outputs31.
Other methods, such as STaR and Quiet-STaR frameworks32,33 introduced an approach to enhancing the reasoning capabilities of language models through recursive thinking, reflection, and iterative refinement of answers. Unlike traditional single-pass models that generate outputs in one step, Quiet-STaR emphasizes a multi-step process where models are encouraged to revisit, refine, and improve their reasoning before arriving at a final answer. This is achieved by integrating several key concepts that foster deeper cognitive engagement and reflection during the decision-making process. At the core of Quiet-STaR is the idea of recursive reasoning, where the model does not simply generate an output in a linear manner, but instead iteratively processes and refines its thoughts. This recursive process mirrors human thinking, where conclusions are often revisited, reassessed, and adjusted before a final decision is made. Quiet-STaR formalizes this by introducing intermediate steps that guide the model through this recursive process, allowing it to build upon its own reasoning in multiple stages. In Quiet-STaR, the model engages in multi-step reasoning cycles, where each iteration produces a more refined version of the previous reasoning. These cycles enable the model to consider various aspects of a problem, explore different reasoning paths, and improve the coherence and depth of its output. This layered reasoning process leads to outputs that are more robust, structured, and better aligned with complex tasks that require detailed thought.
Other methods, such as X-LoRA12 have explored the use of ‘silent tokens’ via the implementation of multiple forward passes, where training proceeded in two stages. First, the training focused on supervised fine-tuning that resulted in a set of distinct fine-tuned models, each realized via LoRA adapters and capable of solving particular tasks (e.g., protein property prediction, scientific methods, domain knowledge, etc.). Related, the X-LoRA model involves training of additional layers in the model that utilize the first forward pass to create hidden states from which the relative contributions of all adapters, at every larger, are computed on a token-by-token level, forcing a state of self-reflection about its own configurational space. Because this strategy requires two forward passes for each token produced, the method can be understood to use silent thinking tokens that are used by the model to configure itself for the next-token prediction task. The self-reflection tokens are never decoded, allowing this approach to invoke a very rich contextual understanding during the thinking phase.
A common theme behind these and related strategies is the use of increased compute during inference, to move away from autoregressive token predictions34 towards more sophisticated strategies where either more effort is spent per token, or where thinking and reflection strategies are employed that allow models to iterate through solutions and develop a higher level of self-awareness about their predictions. Many of the methods discussed above, however, require adaptation of new architectures and model structure changes. As will be shown in this paper, we can utilize some of the ideas by combining them with agentic modeling to create adversarial modeling strategies to ultimately arrive at well-reasoned responses to tasks (see, the flowchart in Fig. 1).
PRefLexOR addresses these challenges by integrating preference optimization with a recursive reasoning mechanism driven by thinking tokens, which explicitly mark phases of reasoning within the model’s output. This allows the model to:
Generate initial reasoning steps.
Revisit and refine those steps through recursive processing, ensuring that reasoning is consistent, coherent, and deeply aligned with scientifically accurate processes and resulting final answers.
Adapt its decision-making by generating new tasks and feedback during training, enabling real-time learning.
The algorithm features two major phases, complemented by agentic inference. We first focus on training strategies and move on to inference methods towards the end of the paper. The first phase is Structured Thought Integration Training, followed by Independent Reasoning Development and ultimately a Recursive Reasoning Algorithm.
At the core of PRefLexOR’s approach is an initial alignment phase achieved using ORPO, which ensures that the model consistently aligns its reasoning with desired outcomes by directly optimizing preference odds. In a second phase, preference optimization strategies are then layered on to handle fine-tuning through rejection sampling, capturing more subtle distinctions in preference and further refining the model’s output. This layered approach, combined with recursive reasoning, makes the model capable of handling open-domain tasks with greater reasoning capacity and adaptability.
The recursive reasoning and iterative feedback loops resemble RL methods, where models learn by refining policies based on rewards and feedback. In PRefLexOR, during training, the model is continually provided with feedback in the form of preferred and rejected responses, which it uses to improve its thought process. This self-teaching mechanism is akin to the policy refinement seen in RL, where iterative feedback loops allow the model to explore, evaluate, and improve its decision-making in real-time.
The dynamic task generation in PRefLexOR introduces an active learning component wherein the model generates tasks, reasoning steps, and negative examples on the fly during training. This method allows the model to handle more nuanced reasoning challenges, evolving its cognitive abilities without the need for extensive pre-curated datasets. The ability to recursively refine thoughts leads to a model that can continuously evolve and adapt to novel, complex problems, effectively teaching itself to reason more deeply and align its outputs with preferred outcomes that align with the ground truth data.
While Quiet-STaR32,33 focuses primarily on recursive reflection and iterative reasoning, it can be enhanced through preference optimization techniques like ORPO and preference optimization. By incorporating these techniques, the model can align its reflective reasoning with preferences rooted in training data, such as scientific papers or simulation results (e.g. Molecular Dynamics, Fluid Dynamics, Finite Element Modeling), or even experimental data, ensuring that its refined thoughts and decisions meet desired outcomes. The recursive cycles in Quiet-STaR, for instance, can be viewed as a form of policy refinement in RL, where the model’s reasoning policy is continually updated based on feedback. When combined with preference optimization, these cycles ensure that the model’s internal reflections are not only coherent but also aligned with external preferences, further enhancing the model’s performance in real-world tasks.
Our method diverges from traditional approaches by not relying on pre-generated datasets. Instead, it dynamically generates new tasks, reasoning steps, and feedback on the fly from the raw data corpus used for training, allowing the model to continuously self-improve by comparing its own responses generated based on its current training state with ground truth answers extracted from the raw data using agentic prompting (details, see Materials and Methods). This allows our model to avoid challenges with having to collect preference data a priori.
Figure 3 shows an overview of the training strategy. Details of all aspects introduced therein will be covered in the remaining sections of the paper.
[See PDF for image]
Fig. 3
PreFLexOR: model development and training strategy overview.
The process starts with a relatively small 3 billion parameter pretrained model (here,
We first present the overall modeling strategy, focusing on the training phases and key aspects such as special tokens and other considerations. We present various inference examples with an in-depth technical analysis of the results. We then proceed to an experimental feature by incorporating multiple phases that feature both thinking and reflection. Using the reflection phase, we implement a recursive algorithm that allows us to improve responses iteratively by scaling inference compute. We conclude with a detailed discussion of strengths, weaknesses, and results.
Results
The training of the model proceeds in two distinct phases, each designed to progressively enhance its reasoning capabilities. This improves the ability develop enhanced reasoning, here exemplified for structured thinking processes. Within the scope of this study, we focus on training sets in scientific applications, specifically biological materials, rather than attempting to generate a general-purpose model. Such broader training objectives could be undertaken in future work to overcome some of the limitations.
In the first phase, the model undergoes Structured Thought Integration Training, where the primary focus is to teach the model how to handle new tokens specifically designed for reasoning, such as
To train the model to recognize and utilize structured prompts containing the new “thinking” (and other) special tokens that delineate the reasoning process.
To establish a preference framework that encourages the model to select and rank responses that demonstrate well-structured, step-by-step reasoning processes.
By the end of this phase, the model has learned how to generate outputs that adhere to explicit thought structures, preparing it to handle more complex tasks that involve reasoning elements.
The second phase, Independent Reasoning Development, shifts the focus toward enabling the model to develop reasoning strategies autonomously. During this phase, tokens within the “thinking” part of the training data are masked, which forces the model to reason independently without relying on explicit markers or structured prompts. The goal of this phase is to:
Encourage the model to generate coherent reasoning and decision-making strategies on its own, without explicitly being taught the thinking process, but rather to focus on the final correct answer.
Enhance the model’s ability to handle more challenging and ambiguous tasks by strengthening its internal reasoning mechanisms.
Refine the model’s decision-making in cases where reasoning complexity increases, ensuring robustness even in extreme or difficult cases as the model sees new never-before-seen question-answer pairs with unknown reasoning steps (the model learns how to develop new reasoning strategies to arrive at the correct answers).
This phase not only improves the model’s performance in handling reasoning tasks but also deepens its ability to make decisions without explicit guidance, preparing it to perform well in diverse, real-world scenarios.
We emphasize that for all training phases, new question-answer data is generated on-the-fly by randomly selecting text chunks from the raw source data using the agentic framework to provide structured thinking mechanisms (details, see Materials and Methods).
We implemented an algorithm to generate domain-specific questions and their corresponding answers based on context retrieved from a pre-constructed index of data generated using
The first phase of training uses ORPO25, a reference model-free method that simplifies preference alignment by leveraging the odds ratio to contrast favored and disfavored outputs. ORPO allows for effective supervised fine-tuning with a minor penalty on disfavored outputs. This phase is used to teach the model the basic steps of thinking, including the introduction of the new thinking, reflection, and other, tokens into the model’s capabilities and answer development, teaching it initial reasoning strategies.
In the second phase, Efficient Exact Optimization (EXO)27 is applied to further refine the model’s performance. EXO is a mode-seeking preference alignment approach that focuses on optimizing a model’s final answers while masking intermediate reasoning (thinking tokens). Unlike DPO, which uses forward KL divergence and can lead to diluted, mean-seeking behavior, EXO minimizes reverse KL divergence, allowing the model to concentrate on the most likely and effective answers. By aligning with the dominant modes in the preference data, EXO enables the model to infer the best reasoning patterns and produce more accurate final outputs, even when intermediate reasoning is hidden. This method results in better performance on tasks where final answer accuracy is prioritized (Fig. 4).
[See PDF for image]
Fig. 4
Training performance using the EXO method across three key metrics, during Independent Reasoning Development.
a The increase in rewards/margins over the course of training, indicating progressive improvement as the model learns. b The corresponding decrease in loss, showcasing successful convergence and optimization of the model, as reflected in a continuous decline in the loss function. c Rewards/accuracy during training, demonstrating rapid convergence toward high accuracy early in training, stabilizing after approximately 200 steps, with consistently high performance maintained throughout.
Once trained using this strategy, the basic structure of the multi-stage process to generate the final answer is as follows:
Further details are provided in the Materials and Methods section.
Sample results
We present a series of inference examples that cover a range of topics and tasks, from questions squarely in the training domain to questions at intersections to other areas, and tasks not included in the training data. These are meant to assess how well the model generalizes not only knowledge but specifically the reasoning steps, and whether it can translate its learned method of responding to tasks. The flexibility of the prompting strategy allows us to trigger various phases of reasoning during inference. In a basic setting, we simply provide the system message and user prompt, from which the model then completes the answer. In more nuanced approaches, we can provide the model with the system message, user prompt, and a draft thinking section. Variations of these can be used to scale inference compute, especially if we can dynamically adapt, improve, and refine the thinking mechanics through recursive reasoning and reflection. Variations of these concepts will be presented in the next sections.
Properties of hierarchical structures
In the first example, we ask the model about why hierarchical structures work so well. No reference to “materials” is made to determine whether the model has been aligned with the domain of materials science during the training process. As shown in Text Box 1, in the “thinking” section, the model first dissects the concept of hierarchical structures by identifying multiple advantages, such as energy dissipation, size adaptation, and material utilization. It demonstrates a sophisticated understanding of how hierarchical structures work across different scales, benefiting mechanical strength, thermal insulation, and impact resistance. The reasoning steps not only delve into the abstract principles behind these structures but also connect them to specific material properties and their applications, notably using examples like nacre.
This process shows that the model can move from theoretical concepts to practical insights, offering clear explanations of how hierarchical designs optimize performance in various domains. The ability to articulate both general principles and specific examples (e.g., aragonite tablets in nacre) shows a nuanced grasp of material science. By synthesizing this information into a structured, coherent final answer, the assistant demonstrates advanced reasoning skills, capable of understanding complex systems and distilling that understanding into a succinct yet thorough explanation.
Biological materials failure mechanism
In Text Box 2 we show an example where the user asks about how biological materials fail gracefully. The “thinking” section elaborates on the underlying mechanisms by which biological materials fail gracefully. It highlights concepts such as viscoelasticity, crack bridging, and fiber sliding, among others, which contribute to energy dissipation and gradual failure rather than sudden collapse. The thinking section also points out the hierarchical structure of biological materials and its role in stress redistribution and energy absorption. This part serves as a deep, reflective reasoning process aimed at comprehensively outlining the failure mechanisms of biological materials.
After the “thinking” section, the model synthesizes the reasoning into a coherent answer that integrates the identified mechanisms—such as viscoelasticity, crack propagation, and fiber pullout—into a concise explanation. Interestingly, the answer directly builds on the detailed exploration in the thinking section but presents it in a more refined and structured manner. The hierarchical structure is again emphasized as a key feature, and practical examples (e.g., bone and nacre) are introduced, demonstrating how the abstract reasoning leads to a well-grounded and practical explanation.
The assistant not only identifies key mechanisms of graceful failure in biological materials but also provides a well-organized, step-by-step breakdown of these complex processes. This result reflects a deeper conceptual grasp of advanced material science principles, such as viscoelasticity, crack propagation, and hierarchical structures. The assistant shows the ability to connect abstract concepts with practical examples (e.g., bone and nacre), further enhancing the clarity of the explanation. The structured reflection in the “thinking” section suggests an advanced reasoning capability. The model’s ability to analyze the topic comprehensively and then distill the information into a succinct, coherent answer demonstrates intelligence comparable to that of a subject-matter expert. We find that this combination of deeper-level theoretical knowledge and the skill to communicate it effectively highlights a sophisticated level of cognitive processing, which goes beyond mere fact retrieval to active synthesis and application of knowledge.
Intersection between literature, philosophy and materials science
In Text Box 3, the user asks a challenging and interdisciplinary question, requesting an explanation of the conceptual connections between Hermann Hesse’s Glass Bead Game (German: Das Glasperlenspiel)36 and proteins. This query was chosen as an example of a task that had not been included in the training set, to see how well the model can generalize its reasoning capabilities to areas outside of materials science. We find that the model’s response in the “thinking” section demonstrates a high level of reasoning and synthesis, drawing parallels between a work of philosophical fiction and the scientific realm of proteins. The response explores both concepts in-depth, highlighting key themes such as structural complexity, hierarchical organization, dynamic nature, interconnectedness, and evolutionary adaptation, which are common to both Hesse’s Glass Bead Game and biological proteins.
The reasoning steps begin by reviewing the main ideas in Hesse’s Glass Bead Game, a metaphor for the interconnectedness of knowledge across art, science, and philosophy. The assistant then draws an analogy with proteins, which are structurally and functionally complex biological molecules. Proteins, like Hesse’s symbolic glass beads, operate on multiple hierarchical levels (primary, secondary, tertiary, and quaternary), and the assistant recognizes that this hierarchical structure is a key feature in both domains. For example, proteins rely on molecular interactions and bonding to form stable structures, much like how Hesse’s game symbolizes the interaction between intellectual domains.
The dynamic nature of both systems is another strong connection. Proteins, with their ability to change conformation and function depending on their environment, reflect the flexibility and adaptability of Hesse’s philosophical game. Proteins are not static entities but evolve and adapt, much like how knowledge and ideas in The Glass Bead Game evolve over time.
The assistant also highlights interconnectedness as a major theme, noting that proteins play roles in various biological processes through interactions with other proteins and molecules, which mirrors how Hesse’s game illustrates the interdependence of art, science, and philosophy. This interconnectedness is a crucial insight, showing how the assistant is able to bridge these abstract and scientific concepts.
The hypothesis put forth Proteins, with their intricate structures and hierarchical organization, and their dynamic nature, are analogous to the interconnected, hierarchical, and dynamic elements of Herrman Hesse’s Glass Bead Game, reflects advanced synthesis. The model positions proteins and the game as analogous systems that, despite being from entirely different domains (biology and philosophy), share deep structural and functional parallels. This hypothesis is a strong foundation for the analysis, suggesting that the underlying principles governing biological materials can also apply to conceptual frameworks in philosophical thought.
The ability to develop such a hypothesis demonstrates sophisticated interdisciplinary thinking, requiring the assistant to understand not only Hesse’s complex metaphysical ideas but also the scientific details of protein function. It combines abstract thinking with scientific rigor, drawing out the universal patterns that apply across both domains. By framing proteins as analogous to Hesse’s symbolic representation of reality, the assistant adds depth to the interpretation of both concepts, demonstrating a holistic understanding of interconnected systems.
The level of discourse required to connect The Glass Bead Game and materials science is remarkably high. Hermann Hesse’s novel is deeply philosophical, exploring the abstract and often metaphysical connections between human intellectual pursuits. On the other hand, proteins, as biological macromolecules, are grounded in the concrete, molecular world of biology and material science. Successfully combining these two requires a profound understanding of both philosophical and scientific concepts.
The assistant’s ability to merge these distinct disciplines involves cognitive flexibility and the capacity to think abstractly about systems. This level of reasoning is typically seen in advanced interdisciplinary studies, where individuals are not confined to a single field but draw on multiple domains of knowledge to generate novel insights. The fact that the assistant effectively bridges philosophical narrative with molecular biology demonstrates the application of high-order thinking skills such as synthesis, analogy, and abstraction.
The parallels between proteins and The Glass Bead Game involve not only shared structural elements (e.g., hierarchical complexity) but also shared functional characteristics (e.g., adaptability, interconnectedness), which are universal principles that transcend disciplines. This demonstrates the assistant’s ability to identify and articulate abstract patterns that apply across seemingly unrelated fields, a hallmark of advanced intellectual discourse.
This inference example showcases an in-depth analysis and synthesis of two very different domains: Hermann Hesse’s Glass Bead Game and the science of proteins. The assistant draws on the structural, hierarchical, and dynamic aspects of both, weaving them together into a coherent and insightful hypothesis. The level of discourse during the thinking phase reflects advanced interdisciplinary thinking, requiring both abstract philosophical interpretation and detailed scientific knowledge. This analysis highlights the universality of certain principles, such as complexity, interconnectedness, and adaptation, that apply across both biological and philosophical systems.
We find that the result is particularly remarkable given that the base model is a very small LLM with only around three billion parameters. The experiment shows that our algorithm endows it with enhanced reasoning capabilities.
Analysis of research abstract and proposal of new hypotheses
In this task, the user presents an abstract from a recently published paper that was not included in the training data37 focused on the development of a novel platform for manufacturing structural myco-composites based on mushroom-derived materials utilizing mycelium. The model is asked to summarize the results and propose research directions, prompting the use of a structured reasoning process. The response in the “thinking” section showcases a comprehensive analysis and well-developed research proposal, reflecting an intelligent, high-level discourse typical in materials science and biocomposites research. We observe that the model has successfully applied its reasoning strategy to this new task (Box 4).
During the reasoning phase, the model starts by summarizing the core findings of the abstract, focusing on key innovations such as high-resolution biocomposite additive manufacturing, robust mycelium colonization, and the scalability and tunability of the resulting myco-composites. By highlighting the mechanical improvements—namely, a 15-fold increase in strength and modulus—it captures the most significant results of the study. The assistant emphasizes the hierarchical composite design and selective nutritional provision as central principles that contribute to the improved mechanical and surface properties. Additionally, it notes the versatility of the platform, demonstrated through applications like foldable bio-welded containers and flexible mycelium textiles, illustrating the study’s practical implications.
The model then proposes a well-framed hypothesis: The novel platform for manufacturing structural myco-composites, leveraging high-resolution biocomposite additive manufacturing and robust mycelium colonization, can create scalable, tunable, and complex geometry compatible myco-composites with superior mechanical and surface properties. Using domain knowledge in materials science, we confirm that this hypothesis effectively captures the innovative aspects of the research and highlights the key attributes of the platform—scalability, tunability, and mechanical superiority. It reflects a clear understanding of the study’s goals and the broader implications for biocomposite and hybrid-living materials research.
To summarize, the model then goes beyond summarization by offering eight well-articulated proposals for research, each building on the foundation laid by the original study, reflecting compositional approaches to develop arguments systematically. These include:
Scaling up and down: investigating the scalability of the manufacturing process to produce both larger and smaller structures, a logical next step for practical applications.
Material properties enhancement: exploring ways to further improve the composite’s properties, potentially by altering the colonization process or integrating new materials, which reflects a forward-looking approach to optimizing performance.
Multifunctional composites: proposing research into creating composites with integrated functionalities, such as self-healing or conductivity, a suggestion that opens the door to entirely new applications.
Biodegradability and sustainability: addressing the environmental aspect of the materials, which aligns with the growing focus on sustainable material science.
Hybrid-living materials: continuing the integration of living organisms with synthetic materials, advancing the frontier of hybrid-living material research.
Complex geometry and topology: exploring the mechanical effects of more intricate geometries, further leveraging the platform’s compatibility with complex structures.
Inoculation strategies: optimizing the colonization process by experimenting with different strains or nutrients, which could lead to further improvements in material performance.
Biocomposite additive manufacturing: developing new manufacturing techniques to improve the resolution and speed of production, which is essential for industrial-scale adoption.
The level of discourse required to effectively summarize and propose research directions based on this abstract is advanced, both in terms of scientific understanding and strategic vision. The assistant demonstrates a solid grasp of materials science principles, particularly regarding the use of hierarchical composite design, additive manufacturing, and biocomposites. Solving this task demanded interdisciplinary knowledge, as it touches on materials science, biology, engineering, and sustainability. The model successfully bridges these areas, synthesizing the information in a coherent manner and proposing forward-thinking, practical research directions. The result is a well-rounded, intelligent response that addresses both the technical details of the research and broader implications for the field. The answers also incorporated more complex concepts from materials science, such as scaling and material optimization to multifunctional and biodegradable composites. that indicate a higher level of domain and interdisciplinary knowledge, as well as creative thinking.
Overall analysis of inference experiments
The examples showed that the model displays a sophisticated level of interdisciplinary reasoning, such as its capability to handle abstract concepts like Hermann Hesse’s Glass Bead Game alongside detailed scientific inquiries into protein structures and myco-composites. A notable strength of the model’s performance is its capacity to make high-level connections between seemingly disparate domains, such as drawing parallels between the hierarchical structure of proteins and the interconnected elements of Hesse’s philosophical game. This ability to fluidly shift between abstract and applied reasoning showcases a nuanced understanding of both conceptual and practical frameworks. What stands out is its ability to extrapolate insightful research directions from a set of findings. For instance, the suggestions to further optimize inoculation strategies for mycelium colonization or to investigate complex geometry impacts on material properties display an understanding of cutting-edge scientific trends. Similarly, when connecting The Glass Bead Game to protein structures, the model provides a compelling, well-reasoned hypothesis that demonstrates the structural and dynamic analogies between the two fields, underscoring a deep conceptual link between philosophy and biology.
We find that the implementation of training strategies inspired by RL methods, where thinking tokens are masked but the model is trained on the final answer, is integral to these high-level insights. This training method emphasizes clarity and precision in the final response, ensuring that the model can produce coherent, well-reasoned conclusions without relying on explicit intermediate reasoning steps. By focusing on outcome-driven learning, the model is able to internalize the reasoning process and deliver sophisticated answers efficiently. This approach is particularly evident in the model’s ability to articulate complex research proposals and cross-disciplinary analogies with minimal overt reasoning, yet delivering accurate and innovative insights. The particular training approach enhances the model’s ability to present answers that are both contextually rich and technically sound, resulting in a streamlined yet insightful final output.
Expanding the analysis to incorporate thinking and reflection for recursive improvement in agentic modeling
To show the flexibility of the method, we experiment also with more complex reasoning mechanisms, specifically combining
[See PDF for image]
Fig. 5
Structured thought and reflection in answer generation.
This diagram illustrates the multi-step process of answer generation, incorporating both structured thinking and reflection phases to ensure thoroughness and accuracy. As in the original approach, the process begins with the <∣
The basic structure of this multi-stage process is as follows:
Text Box 5 shows a sample conversation answering the question
The second step, reflection, serves to refine the initial ideas. In this phase, the model critically evaluates its reasoning and proposes specific improvements, such as clarifying the role of hierarchical organization and recognizing the interplay of multiple factors like structure and composition. This reflection process helps the model fine-tune its inferences, leading to a more accurate and complete final answer.
By separating the process into thinking and reflection, the model ensures that its inference mechanism is both exploratory and self-correcting. The result is a well-balanced answer that combines reasoning with critical evaluation.
In the thinking phase, the model generates the following reasoning steps:
Mechanical properties: hierarchical structures exhibit unique properties across different length scales, enabling material flexibility and strength.
Material organization: these properties result from organized changes in material composition at various scales.
Anisotropic nature: these structures behave anisotropically, adapting their mechanical properties based on directional requirements.
Functional adaptation: this anisotropic behavior allows materials to efficiently perform different functions.
In the reflection phase, the model revisits its previous reasoning and suggests the following improvements:
Clarify hierarchical organization: emphasize that the properties of hierarchical structures result from changes in material properties and structure at different scales.
Interplay of factors: recognize that hierarchical structures are influenced by a combination of factors—structure, composition, and architecture—rather than a single dominant feature.
Cost considerations: introduce the complexity and cost associated with designing and manufacturing hierarchical structures.
Anisotropic nature: clarify that anisotropic behavior is a product of organized material changes, not an independent principle.
This structured inference process allows the model to generate reasoning, reflect on its accuracy, and refine the answer. The thinking phase handles the exploration of possible answers, while the reflection phase corrects and refines them, resulting in a more informed and optimized final answer.
Recursive reasoning algorithm through reflection
The inclusion of reflection phases in the model’s output allows us to implement a Recursive Reasoning Algorithm, a method designed to enhance the quality and depth of responses generated by the reasoning model by iteratively improving its reasoning steps in the thinking phase based on the reflection feedback.
The algorithm utilizes a multi-agent format, and exploits a synergistic interaction between two distinct models: The fine-tuned reasoning model and a general-purpose critic model. The reasoning model, specialized through careful fine-tuning, excels in generating structured, logical responses to given prompts. It not only produces initial responses but also demonstrates the capability to iteratively improve its outputs based on feedback. Complementing this, the critic model serves as an evaluator and improved, analyzing the reasoning model’s outputs and improving it based on the feedback received.
At the heart of the algorithm lies an iterative process shown in Fig. 6, forming the core mechanism for continuous improvement of responses. This process begins with the Reasoning Model generating an initial response to a given prompt. Subsequently, this response undergoes a cycle of refinement. Each iteration involves a thorough analysis of the current response, from which a reflection is extracted (indicated via
[See PDF for image]
Fig. 6
PRefLexOR recursive reasoning algorithm: an iterative approach leveraging a fine-tuned reasoning model and a general-purpose critic model to generate, refine, and optionally integrate responses.
The process involves generating initial responses, extracting reflections, improving thinking processes, and creating new responses based on refined thinking, with an optional final integration step. The algorithm relies on extracting thinking processes (indicated via <∣
This cycle of generation, evaluation, and refinement continues for a predetermined number of iterations or until the algorithm achieves a response that meets specified quality criteria. The iterative nature of this process allows for the progressive enhancement of responses, with each cycle building upon the improvements of the previous one.
Upon completion of the iterative process, the algorithm presents two options for final output selection. The first option is to select the response from the final iteration as the definitive output. Alternatively, the algorithm offers the capability to integrate all generated responses into a comprehensive final answer, potentially capturing a broader range of insights and perspectives developed throughout the iterative process.
This approach combines the structured thinking of the specialized reasoning model with the broader perspective of the critic model. The result is a system capable of producing responses that are not only logically sound but also nuanced and comprehensive. The recursive response algorithm strives to emulate human-like reasoning and problem-solving processes, leading to higher-quality outputs in various natural language processing tasks during inference.
We show an example result based on this algorithm in Text Box 6. Table 1 shows an analysis of the text produced over three iterations, clearly showing how the responses are improved successively. Figure 7 depicts a quantitative analysis of the writing quality over the iterations, conducted using
Table 1. Comparison of responses to the question around biological failure mechanisms, as shown in Text Box 6
Feature | i = 0 | i = 1 | i = 2 |
|---|---|---|---|
Basic concept explanation | ✓ | ✓ | ✓ |
Hierarchical structures mentioned | ✓ | ✓ | ✓ |
Periodic hierarchies mentioned | ✓ | ✓ | ✓ |
Detailed explanation of structures | × | ✓ | ✓+ |
Energy dissipation mechanisms | × | ✓ | ✓ |
Specific examples (e.g., nacre) | × | ✓ | ✓+ |
Quantitative information | × | × | ✓ |
Discussion of loading conditions | × | × | ✓ |
Well-structured response | × | ✓ | ✓+ |
Comprehensive summary | × | × | ✓ |
This table compares three responses obtained via iterative using the algorithm depicted in Fig. 6, for three steps (i = 0, i = 1, i = 2), explaining how biological materials fail gracefully. A checkmark (✓) indicates the presence of a feature, a cross (×) indicates its absence, and a checkmark with a plus (✓+) indicates the feature is present and more extensively covered. Response i = 2 is the most comprehensive, covering all aspects with greater depth and including additional elements like quantitative information and loading condition effects.
[See PDF for image]
Fig. 7
Scores of model responses to the question “How do biological materials fail gracefully” across three iterations (i = 0, i = 1, and i = 2).
Each bar represents the score for one of the evaluated criteria: Coherency, accuracy, depth of explanation, and clarity, with the fifth bar showing the average score for each iteration. The final iteration (i = 2) exhibits the highest overall performance, reflecting improvements in the depth and technical accuracy of the explanation. The color scheme differentiates individual criteria (in shades of blue) from the average score (in red).
The final answer correctly identifies that biological materials fail gracefully due to a combination of hierarchical structures and periodic hierarchies that operate at multiple scales, from the molecular to the macroscopic level. These hierarchical structures help redistribute stress and dissipate energy, while periodic hierarchies—repeating patterns at various scales—enhance toughness and prevent sudden failure. Mechanisms such as helical fibers, sacrificial bonds, and mineral bridges contribute to energy dissipation, making the material more resistant to catastrophic failure. The specific effects of these features can vary depending on loading conditions, such as impact, tensile, or compressive stress, allowing biological materials to maintain functionality under diverse mechanical demands.
We note that further research should be done to examine this mechanism and perhaps including iterative refinement in the reinforcement training process. There are many important directions to be explored, such as how to best train for improved thinking and reflection processes using masking, and/or which particular RL approach may work best.
We note that we can also implement recursive sampling in the original PRefLexOR model that does not feature explicit reflection tokens, using a multi-agent strategy of multiple interacting LLMs. For instance, by using a Critic agent to produce reflection of the produced thinking process (see, Fig. 8 for the flowchart for this algorithm). As shown in Fig. 9, we can improve the thinking mechanism, an agentic approach is used where first, we develop a critique of the thinking mechanism, and then use the critique to improve it. We then recursively sample new responses. As before the best result is the response in the third iteration.
[See PDF for image]
Fig. 8
PRefLexOR recursive reasoning algorithm, for the model with only thinking tokens.
Here, reflection is generated using the Critic agent and then used to improve the thinking process. This resembles an iterative approach leveraging the Reasoning Model and a general-purpose Critic Model to generate, refine, and optionally integrate responses. As before, the process ultimately involves generating initial responses, extracting reflections, improving thinking processes, and creating new responses based on refined thinking, with an optional final integration step. The algorithm relies on extracting thinking processes (indicated via <∣
[See PDF for image]
Fig. 9
Scores of model responses to the question “How do biological materials fail gracefully” across three iterations (i = 0, i = 1, and i = 2), for the thinking-only model (no explicit reflection tokens, following the algorithm presented in Fig. 8).
To improve the thinking mechanism, an agentic approach is used where first, we develop a critique of the thinking mechanism, and then use the critique to improve it. We then recursively sample new responses. As before, each bar represents the score for one of the evaluated criteria: Coherency, accuracy, depth of explanation, and clarity, with the fifth bar showing the average score for each iteration. The final iteration (i = 2) exhibits the highest overall performance, reflecting improvements in the depth and technical accuracy of the explanation. The color scheme differentiates individual criteria (in shades of blue) from the average score (in red).
Text Box 7 shows the responses over the iterations. Text Box 8 depicts the integrated response, featuring a lengthier but more comprehensive result. It excels in its comprehensive detail and integration of key mechanisms, providing the most thorough and well-rounded response. It incorporates all important elements from earlier versions, adding depth and examples. However, this version is slightly less concise and could be overwhelming for a brief answer, as it introduces more complexity and repetition compared to the more streamlined and focused responses from the recursive steps.
Comparison with non-fine-tuned model
We now compare the PRefLexOR model with responses from the base model to develop an understanding of how the approach improves performance. When the non-fine-tuned base model develops answers to tasks described above, we generally find that the responses are not aligned to the application domain of biological materials science and do not feature thinking sections, and hence lack the systematic reasoned development of answers based on scientific principles. Text Box 9 shows an example that illustrates the different, non-domain-specific and more generic response without any thinking and reflection section, which we contrast to the results shown in Text Box 5.
The two responses analyzed provide different but in some sense, complementary perspectives on the effectiveness of hierarchical structures. The response by the non-fine-tuned model takes a broad, organizational view, focusing on the practical benefits of hierarchies in fields like business and government, and does not focus on materials science applications. It highlights key principles such as clear lines of authority, specialization, and accountability. The response emphasizes how hierarchies facilitate efficient communication, structured decision-making, and scalability, making them highly effective in large, complex organizations. Additionally, it addresses the role of motivation and incentives, noting that hierarchical structures provide clear career paths and promote accountability, which in turn enhances productivity and organizational success. However, while the non-fine-tuned model offers a comprehensive overview of the managerial benefits of hierarchies, it lacks a deeper exploration of the fundamental principles that extend to other fields, such as material science or biology.
In contrast, the response from the reasoning model trained on scientific data presents a more specialized, technical analysis, focusing on a deep analysis of hierarchical structures within the context of materials science, particularly biological materials. This response goes deeper into the multi-scale organization of hierarchical systems, explaining how they enable the efficient absorption and distribution of energy, often through anisotropic behavior. It highlights the superior mechanical properties that arise from hierarchical designs, such as enhanced strength, toughness, and adaptability. A key concept introduced is progressive damage, a mechanism that allows hierarchical materials to fail gracefully rather than catastrophically, contributing to their durability in both natural and engineered systems. This response provides a detailed explanation of the structural advantages of hierarchical designs, particularly their ability to maintain functionality under stress through organized changes in material properties at different length scales.
Further, we briefly the response to the task given by a general-purpose commercial model,
Critically, the “thinking” and “reflection” components, prominent in the response from the reasoning model, are missing from the non-fine-tuned models. The inclusion of a structured “thinking” phase allows for a detailed breakdown of the reasoning behind hierarchical structures, focusing on mechanical properties, material organization, and functional adaptation. This explicit reasoning process helps build a logical argument, grounding the response in scientific principles. Furthermore, as discussed above, the “reflection” phase offers an opportunity to refine the explanation by addressing potential improvements, clarifying assumptions, and considering broader implications, such as the costs or complexities of hierarchical systems. This iterative approach to reasoning—thinking followed by reflection—enhances the depth and rigor of the analysis, particularly in technical contexts. In contrast,the response from the non-fine-tuned model lacks such a reflective element, offering a more static presentation of ideas without delving into the nuances or reconsidering the assumptions behind its claims.
While the response from the non-fine-tuned model offers a broad, functional perspective applicable to various fields, the response from the reasoning model provides a more rigorous, scientific analysis with a focus on the underlying mechanical principles. It offers a much deeper understanding of the structural advantages, particularly in biological and materials science applications. This and earlier inference examples show that the iterative, on-the-fly training method not only produces thinking and reflection sections but also provides deep domain knowledge. The iterative nature of the training strategy allows users to iteratively improve, enhance, and refine training objectives.
Discussion
The study reported in this paper addressed the challenge of fine-tuning generative models of synthetic intelligence, such as LLMs, to a specific scientific application domain, while endowing it with particular reasoning capabilities for enhanced modeling of scientific thinking processes (Fig. 1c). Inspired by biological systems’ adaptability and evolution, PRefLexOR’s recursive optimization approach mimics the processes through which natural materials achieve resilience and complexity. Just as biological systems self-organize and adapt to achieve optimal performance19, PRefLexOR uses iterative feedback loops to refine and evolve its reasoning pathways when answers to tasks are developed. This bio-inspired approach allows the model to enhance its decision-making abilities, achieving coherence and adaptability reminiscent of nature’s design principles, particularly in applications involving biological materials and cross-domain scientific discovery.
We view this as an extension of more conventional physics or data-driven models that typically feature only forward capabilities without situational awareness. In other words, conventional models cannot assess the quality of their own predictions. For example, a partial differential equation will confidently predict solutions to boundary value problems whether or not the model actually captures the underlying physics), true for both physics-based or data-driven models (Fig. 1b). In conventional scientific methods, humans will assess the quality of predictions using a host of methodologies, specifically logical assessment, additional data collection, comparison with literature, and more. Our quest to expand the reference to a model to include not only its forward capabilities but much broader situational awareness is, in our opinion, an important area of research that can benefit greatly from synthetic, or AI24, especially in applications to solving inverse materials design problems7,20,38,39. PRefLexOR offers one possible avenue to overcome these limitations through a multi-stage training and inference strategy, as visualized in Fig. 10. We demonstrated, as can be seen in Text Box 9, the unique capabilities of PRefLexOR that uses a recursive reasoning framework by qualitatively comparing its outputs with a baseline pretrained model (
[See PDF for image]
Fig. 10
Overview of the PRefLexOR algorithm, consisting of base model pre-training/incipient fine-tuning, structured thought integration training, independent reasoning development, and the recursive reasoning algorithm.
Each phase can be scaled independently with additional computing to improve performance.
While collecting preference data for multi-step reasoning poses substantial practical challenges -primarily due to the complexity and cost of obtaining reliable human judgments - our approach circumvents these limitations by autonomously generating synthetic preference data within the recursive reasoning process itself. As illustrated in Fig. 10, PRefLexOR leverages internal reasoning loops combined with automated evaluation mechanisms to create self-supervised preferences, thus eliminating dependence on manual annotation. This automated preference generation method allows the model to iteratively refine its reasoning strategies at scale, providing a robust solution to the problem of recursive reasoning optimization in contexts where human feedback is impractical or unavailable. If it were available, it can be easily incorporated into the algorithm, adding further flexibility. Further research should be done to expand these experiments, and specifically to explore applications to other scientific domains, or broader problems like mathematics or logic. This would help to address the limitation of this work that the model developed here was focused specifically on biological materials science as a domain.
While conventional automated curriculum approaches leverage external knowledge bases and RAG, inherently providing robustness to errors through external knowledge retrieval, PRefLexOR directly updates the model’s internal parameters during recursive self-optimization. Although this direct parameter updating approach can potentially introduce risks of model degradation due to low-quality self-generated reasoning, we implemented several strategies to mitigate such risks: explicit preference-based filtering (ORPO and EXO methods), structured reflection to identify and correct reasoning errors, masking of intermediate reasoning steps to focus training on high-quality final answers, and continuous dynamic data generation to enhance reasoning diversity. We occasionally observed transient degradation early in training. However, we find that these instances were short-lived, and the model consistently recovered and improved as recursive training progressed, and the model developed consistent reasoning strategies to answer tasks. We speculate that careful design of the recursive optimization process effectively ensures stable model improvement, with further stabilization potentially achievable using larger-scale models. However, more research should be conducted to address some of these aspects, specifically to better understand the evolution of answering strategies during training.
The broader landscape of recursive reasoning models continues to evolve rapidly in recent literature. Hence, more extensive comparative analyses between emerging approaches will be essential for identifying complementary strengths and advancing the general capabilities of reflective, recursive models. Other research could focus on the development of preference datasets, which can be challenging in cases where little or no reasoning data is available. In our case, scientific papers provide a rich source for reasoning steps, but we acknowledge that this may be more challenging in other areas where such data is not available.
The key contributions of PRefLexOR are:
A new integration of preference optimization with recursive reasoning to allow models to engage in multi-step thought refinement.
A framework that uses thinking tokens to explicitly define and guide recursive reasoning within the model’s outputs.
The incorporation of ORPO and preference optimization (e.g. DPO/EXO) to align model reasoning with scientific methods and preferences through direct and fine-tuned optimization.
An active learning mechanism that enables real-time task generation, ensuring flexibility and adaptability to new reasoning challenges.
The application of recursive optimization that mirrors RL feedback loops, allowing the model to self-teach and iteratively improve its cognitive capacities.
Highly structured approaches to solve problems, especially relevant for science, can be used to endow the model with specific reasoning strategies relevant for particular domains.
Since our training was conducted with low-rank (LoRA) adapters, training can be done efficiently on local GPU hardware, and easily extended to cover a wider range of adaptations (and it can be utilized, for instance, in mixture-of-expert strategies such as X-LoRA12).
Looking to a few specific examples of results, one of the most compelling highlights is the model’s ability to draw meaningful connections between seemingly disparate fields, such as its analogy between Hermann Hesse’s Glass Bead Game36 and the hierarchical structure of proteins. This comparison underscores the model’s capacity for interdisciplinary reasoning, demonstrating how abstract philosophical concepts about interconnectedness and dynamic systems can be mapped onto concrete scientific phenomena, such as the layered complexity and functionality of biological systems. This synthesis of ideas illustrates the model’s potential to not only operate across diverse domains but also generate novel insights by bridging the gap between abstract thought and applied science. Another demonstration of interest was the transfer of the reasoning capability to new tasks, such as summarization and research proposal development.
We speculate that the invocation of the Glass Bead Game36 goes beyond the use as a test case to probe the model’s generalization capabilities, but forms also an analogy to what advanced reasoning models can do. In his novel, Hermann Hesse presents a game that synthesizes knowledge from various fields—such as mathematics, music, and philosophy—into a higher-order conceptual framework, with players combining ideas in ways that reveal deeper patterns and insights. Within the scope of PRefLexOR, this game becomes a metaphor for how thinking and reflection processes in reasoning models, operate. Just as players of the game engage in an iterative exploration of connections between disparate disciplines, LLMs with thinking and reflection phases mimic this recursive synthesis. The “thinking” and “reflection” phases, along with recursive agentic self-improvement, allows the model to explore multiple layers of reasoning and refinement for cohesive responses (see, e.g., Fig. 6, much like the Glass Bead Game connects concepts across domains. The structured interplay of thought and reflection in reasoning models echoes the intellectual depth and complexity of Hesse’s game, suggesting that, like the Glass Bead Game, such models may be capable of uncovering rich, interdisciplinary insights when guided by sophisticated reasoning strategies. This capacity to connect and reflect upon diverse ideas highlights the potential of LLMs to act as powerful tools for understanding, much like the characters in The Glass Bead Game use their symbolic play to explore the essence of knowledge itself as it resembles connections between bits of information, as shown in Fig. 1a.
Specifically, the Glass Bead Game36 is a symbolic system that serves as “a kind of synthesis of human learning.” The game represents a means of integrating and refining knowledge from diverse disciplines, such as mathematics, music, and philosophy. Players engage in an iterative process, continuously refining and revisiting concepts to discover deeper relationships between them. Similarly, the algorithm in this method employs a recursive approach where a fine-tuned Reasoning Model generates an initial response, which is then subjected to reflection and improvement through multiple iterations.
The process begins with the generation of an initial response, analogous to the first move in the Glass Bead Game, where the players begin with basic knowledge. As in the game, where players continually refine their moves through reflective thought, the algorithm extracts reflections from the initial response, enhancing the reasoning behind it. The Critic Model plays a role much like the intellectual rigor imposed by the rules of the Glass Bead Game, providing an evaluative framework that helps guide the refinement of responses. Through this iterative process, the model improves its output, cycling between generating new responses and reflecting on previous iterations until an optimal or integrated solution is reached.
This recursive thinking and reflection model mirrors the way the Glass Bead Game synthesizes diverse strands of knowledge into a cohesive whole. Just as the game is meant to model a kind of synthesis of human learning, the algorithm integrates reasoning and reflection to create responses that combine multiple iterations of thought into a more comprehensive final answer. In this way, the recursive algorithm not only produces more refined outputs but also illustrates how generative AI can emulate deep, interdisciplinary reasoning, much like the intellectual pursuit portrayed in Hesse’s game. Figure 11 depicts a possible flowchart of such an algorithm that merges ideas proposed in the PRefLexOR framework with the process introduced in the Glass Bead Game.
[See PDF for image]
Fig. 11
Flowchart of an expanded PRefLexOR algorithm forming a new approach that expands the original recursive reasoning algorithm depicted in Fig. 6.
The diagram of this proposed approach illustrates how the algorithm incorporates symbolic representation, interdisciplinary synthesis, and collaborative refinement to enhance its reasoning capabilities. This integration aligns the algorithm with the game’s emphasis on the unity of knowledge and deep contemplation across domains, knowledge fields, and modalities. A key shift, comparing this to Fig. 6, is that a singularly focused Reasoning Model was replaced with a set of Collaborative Agents that have developed particular capabilities to examine logical steps towards solving a problem, and the Critic with an Interdisciplinary Knowledge Base Model that transcends across boundaries of fields. An additional feature of emphasis is the utilization of symbolic representation of knowledge, a feat that may resemble our incipient attempt to formulate certain categories of thinking (see, Table 4) that yield highly structured thought processes.
In the integrated framework, a simple Reasoning Model is replaced with a set of Collaborative Agents, each acting as an individual reasoning engine with specialized expertize or perspectives (or a single model with distinct sets of special tokens to induce a particular type of reasoning specialty). This transformation allows the algorithm to simulate a community of thinkers, reflecting the collective intellectual exploration emphasized in the Glass Bead Game36. By incorporating multiple reasoning models as collaborative agents, the algorithm harnesses diverse viewpoints and methodologies, enhancing creativity, depth, and robustness in problem-solving. Each agent contributes unique insights, challenges others’ ideas, and collaboratively refines responses through iterative dialog, much like the scholars in the Glass Bead Game who engage in symbolic synthesis across disciplines.
Similarly, the Critic is replaced with an Interdisciplinary Knowledge Base Model, serving as a rich repository of information from various fields that all agents can access and utilize. This shift moves the focus from evaluation to synthesis, aligning the algorithm with the game’s emphasis on the unity of knowledge and deep contemplation by finding new connections40. The knowledge base enables agents to draw connections across different domains, fostering holistic understanding and allowing for more profound insights. By integrating this shared resource, the algorithm encourages collaborative synthesis rather than hierarchical critique, mirroring the Glass Bead Game’s practice of unifying arts and sciences through collective intellectual endeavor.
These revisions emulate collective intellectual endeavors at high levels of integrated societal scales, simulating a community of thinkers enhances the algorithm’s ability to explore complex problems from multiple angles. It incorporates diverse expertize via agents with specialized knowledge contribute to a more comprehensive and nuanced understanding. This is believed to improve problem-solving as collaborative refinement leads to innovative solutions and deeper insights. Replacing the Critic with the Interdisciplinary Knowledge Base Model24,40 improves the algorithm by facilitating knowledge synthesis, providing agents with access to a broad spectrum of information promotes interdisciplinary connections. Emphasizing synthesis over critique fosters a holistic approach to reasoning by aligning with universal concepts of structures in knowledge representations, e.g. identified via isomorphic mappings. A shared knowledge base serves as common ground for agents to collaboratively build upon ideas, as was demonstrated already in earlier work using graph reasoning40. This reconfiguration aligns the algorithm with the philosophical foundations of the Glass Bead Game, enhancing its capacity for profound, interconnected, and innovative reasoning. It transforms the algorithm into a more powerful system that mirrors the game’s emphasis on collaborative exploration, symbolic synthesis, and the unity of knowledge, ultimately leading to richer and more insightful responses.
We postulate that a feature of importance is the use of symbolic representation of knowledge. This may resemble our incipient attempt to formulate certain categories of thinking as already conducted in the PRefLexOR algorithm implemented in this study. For instance, we refer to Table 4) that yield highly structured thought processes here tailored to the field of biological design. We anticipate that we can structure these inherently unique to meet a more generalistic logical progression of ideas. Alternatively, we may be able to develop concepts that utilize reasoning before decoding hidden states into tokens (as done in X-LoRA) to yield highly complex, abstract, thought processes. Alternatively, one may utilize a finite set of algebra to force a bottleneck of expressing reasoning in a narrow vocabulary of relationships.
As shown in Fig. 11, the integration of these components transforms the response generation process into a multidimensional, reflective system. First, symbolic representation abstracts initial responses into a universal form, facilitating manipulation across disciplines. This feeds into interdisciplinary synthesis, where knowledge from diverse fields enriches the response, promoting unity of understanding. Through contemplative reflection, deeper insights are uncovered, while collaborative refinement allows multiple agents to contribute diverse perspectives, enhancing the intellectual depth of the process. The system undergoes an evolutionary memory update, incorporating new insights for continuous learning. The entire process is driven by an iterative synthesis, looping through these stages until a refined and comprehensive understanding is achieved, mirroring the intellectual rigor of the Glass Bead Game. This revised PRefLexOR algorithm may thereby further enhance its capacity for profound, interconnected reasoning, aligning with the philosophical foundations of the Glass Bead Game. This results in responses that are richer, more innovative, and deeply reflective of a unified knowledge framework.
For further delineation of key analogies and processes, Table 2 provides a proposed comparison of how key concepts are integrated into the multi-agent reasoning framework. Each component represents a crucial enhancement to your algorithm, enabling it to generate more profound, interconnected, and innovative responses, and we provide a direct delineation with the existing algorithm. The task of symbolic representation seeks to convert the initial response generated by the model into symbolic form using a universal language or encoding system. This facilitates manipulation and combination of concepts across different domains. For example, if the LLM provides an initial explanation of a biological process, this step translates key concepts like “cell division” or “DNA replication” into symbols or diagrams, emphasizing the use of symbolic language to connect ideas. This mirrors the game’s core activity of manipulating symbols to reveal deep connections between disciplines. This could be accomplished by introducing a special token for this particular purpose, to teach models to achieve such abstraction. Next, interdisciplinary synthesis combines symbolic representations with knowledge from multiple disciplines to enrich the response. The LLM accesses a diverse knowledge base spanning various fields and uses thinking tokens to integrate insights from different domains. For instance, integrating mathematical models with philosophical theories to address complex problems like ethical considerations in AI. This mirrors the game’s synthesis of arts and sciences, promoting the unity of knowledge. This can be accomplished by introducing yet another special token for this particular purpose. During contemplative reflection, the process engages in deep, meditative reflection on the synthesized knowledge to uncover hidden insights. The LLM uses reflection tokens to introspect and perform deep analysis. For example, after synthesizing information, the LLM reflects on the ethical implications of its conclusions, considering long-term impacts. This captures the game’s meditative and introspective aspects, encouraging profound contemplation.
Table 2. This table summarizes how key concepts from the Glass Bead Game are integrated into a multi-agent reasoning framework
Component | Description | Integration with the modeling framework | Relation to glass bead game concepts |
|---|---|---|---|
Symbolic representation | Converts the initial response into symbolic form using a universal language or encoding. | Uses Thinking Tokens to guide the LLM in creating symbolic abstractions, facilitating abstraction and generalization of ideas, and enabling manipulation and combination of concepts. | Emphasizes the use of a symbolic language to connect ideas, reflecting the game’s core activity of manipulating symbols. |
Interdisciplinary synthesis | Combines symbolic representations with knowledge from multiple disciplines to enrich the response. | The LLM accesses an Interdisciplinary Knowledge Base, uses Thinking Tokens to integrate diverse insights, enhancing creativity and innovation. | Mirrors the game’s synthesis of arts and sciences, promoting the unity of knowledge. |
Contemplative reflection | Engages in deep, meditative reflection on the synthesized knowledge to uncover hidden insights. | Utilizes Reflection Tokens for introspection; the LLM performs deep analysis of implications and principles. | Captures the game’s meditative and introspective aspects, encouraging profound contemplation. |
Collaborative refinement | Multiple collaborative agents contribute diverse perspectives to refine the response. | Implements Multi-Agent Interaction among LLMs; agents use Thinking and Reflection Tokens to contribute and evaluate ideas, simulating a community of thinkers. | Emulates the game’s collective intellectual endeavor, enhancing responses through collaborative intelligence. |
Evolutionary memory update | Updates the system’s memory with new insights, enabling evolution over iterations. | The LLM stores successful patterns and connections; uses Thinking Tokens for learning and Reflection Tokens for evaluation, improving reasoning strategies. | Reflects the game’s evolutionary iteration, supporting continuous growth and learning. |
Explicit abstraction | Makes abstraction an intentional and directed process within the framework. | Guides the LLM to align with specific goals; enhances the quality and depth of reasoning; uses explicit prompts for abstraction. | Aligns with the game’s emphasis on symbolism and abstraction, facilitating the creation of harmonious connections. |
Bridging implicit and explicit abstraction | Connects the LLM’s inherent abstraction with explicit symbolic reasoning. | Combines the LLM’s strengths with guided processes; enhances explainability and control; leverages both implicit and explicit reasoning. | Enhances the game’s practice of connecting visible and underlying patterns, enriching the intellectual depth of the process. |
By incorporating components such as symbolic representation, interdisciplinary synthesis, and collaborative refinement, the algorithm enhances its capacity for profound, interconnected responses, aligning with the game’s emphasis on the unity of knowledge and deep contemplation.
Key concepts are highlighted in bold font in the table.
In collaborative refinement, multiple collaborative agents contribute diverse perspectives to refine the response. Implements multi-agent interaction among LLMs, where agents contribute ideas and evaluate each other’s inputs. For example, agents specializing in different fields collectively refine the response: For instance, one focusing on technical accuracy, another on ethical considerations, and a third on societal impact, physical soundness, experimental feasibility, and so on. This emulates the game’s collective intellectual endeavor, enhancing responses through collaborative intelligence. The step evolutionary memory update updates the system’s memory with new insights, enabling evolution over iterations. The model stores successful patterns and connections for use (e.g., via category graph representations), using thinking and reflection tokens to guide learning and evaluate effectiveness. This reflects the game’s evolutionary iteration, supporting continuous growth and learning. During explicit abstraction, the model renders an abstraction an intentional and directed process within the framework. Uses explicit prompts to direct the model’s abstraction efforts toward specific goals, improving the depth and coherence of reasoning. For instance, instructing the model to focus on abstract principles underlying data rather than just summarizing it, perhaps using symbolic mechanisms, or using similar special tokens as used above during the initial symbolic representation. This aligns with the game’s emphasis on symbolism and abstraction, facilitating the creation of harmonious connections.
We can further seek to endow the model with capabilities to conduct implicit and explicit abstraction, where we connect the inherent abstraction capabilities with explicit symbolic reasoning. Combines the model’s strengths with guided processes, enhancing transparency in the reasoning process. While the LLM may naturally abstract concepts, the framework ensures these abstractions are represented in alignment with the overall reasoning strategy. This enhances the game’s practice of connecting visible and underlying patterns, enriching the intellectual depth of the process.
As the algorithm proceeds, the flowchart itself may be optimized by planning agents, akin to what has been reported in other multi-agent systems, for instance, using concepts from graph reasoning24. This can include suggestions for deepening interdisciplinary connections by expanding the knowledge base, enhancing reflective depth through recursive self-improvement loops, optimizing collaboration with dynamic agent roles, and leveraging human-AI synergy for oversight and input.
Integrating these components may ultimately align the algorithm with the philosophical and methodological foundations of the Glass Bead Game and thereby enhance its capacity to generate responses rich in insight, creativity, and interconnected understanding, encouraging the algorithm to transcend traditional problem-solving approaches and embrace a holistic, integrative perspective.
Several avenues for future work offer important opportunities to enhance the capabilities of our model. Key directions include exploring agentic reasoning strategies, such as AutoGen41 and high degrees of agentic modeling via swarm-based approaches, and scaling to larger models for increased performance. Additionally, testing the model’s generalizability across diverse domains and incorporating multiple thinking sections with partial masking are promising methods for improving reasoning efficiency.
While PRefLexOR demonstrates promising results, certain limitations warrant further investigation. One such limitation is that the framework’s increased computational cost, especially in its recursive phases, may limit real-time applications, necessitating optimization strategies. While in some cases where compute is not an issue, such as scientific discovery, this may not present a significant burden, it may be important in other applications or in cases where the model size is significantly larger. Its current focus on specific domains and using small models suggests that work should explore other areas of applications including broader, multi-disciplinary training.
We further note that in contrast to earlier work on LLMs using reflection-focused schemes29, 30–31, our approach—Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning (PRefLexOR)—introduces a fundamentally distinct training paradigm that integrates explicit recursive reasoning, preference-driven optimization via ORPO and EXO methods, dynamic in-situ data generation, and structured masking of intermediate reasoning steps. These innovations collectively aim to deepen reasoning capabilities, coherence, and robustness in generative models.
In terms of a comparison between Quiet-STaR33 and PRefLexOR, they differ fundamentally in how reasoning is elicited and optimized within language models. Quiet-STaR trains the model to generate internal rationales implicitly at each token position, optimizing them using a REINFORCE-based algorithm without explicit reference to human-like iterative reflection. In contrast, PRefLexOR explicitly incorporates recursive reasoning loops with structured reflection phases, using preference-based optimization methods (such as ORPO and EXO) to iteratively refine reasoning strategies. Additionally, Quiet-STaR optimizes general rationale generation across unstructured web-text, while PRefLexOR specifically structures recursive reasoning into distinct phases, leveraging symbolic tokens to guide explicit reflection and refinement in specialized reasoning contexts.
X-LoRA has been proposed as a flexible and computationally efficient method that leverages mixtures of low-rank adapter experts, each optimized for particular tasks or modalities12. As discussed in the introduction, the core concept is the use of dynamically activated adapters through silent tokens, allowing models to contextually determine the most relevant reasoning or knowledge integration pathways. This offers limited reasoning depth, and little control over how and to what extent reasoning steps are conducted.
However, one promising avenue where concepts from X-LoRA could be combined with the PRefLexOR framework would be through an approach where multiple PRefLexOR reasoning models, each adapted to distinct tasks or specialized reasoning strategies, are integrated via an X-LoRA-like gating and selection mechanism. Specifically, each of these individual PRefLexOR models would employ recursive preference-driven training to optimize reasoning and reflection within its specialized domain. X-LoRA’s adapter-based architecture could then dynamically select or blend these specialized reasoning engines on a token-by-token basis during inference, effectively creating a highly adaptive and versatile ensemble system.
Such integration would combine the depth and self-reflective capabilities inherent in the PRefLexOR recursive optimization framework with the computationally efficient, dynamic adaptability characteristic of X-LoRA. Practically, this could significantly enhance the model’s ability to handle complex, multi-domain reasoning tasks. For example, different adapters might specialize in scientific hypothesis generation, quantitative reasoning, design principles, or philosophical reflection, and the integrated X-LoRA layer would dynamically activate and combine these specialized capabilities, depending on the particular reasoning challenges posed by a given task. Ultimately, this combined strategy could yield a unified model that retains deep coherence and accuracy while benefiting from rapid adaptation and flexibility across a diverse range of scientific and exploratory tasks.
We further posit that future work should focus on refining reasoning strategies, including more structured outputs (e.g. additional steps to discovery reasoning categories from data) and integrating other methods, potentially mixing various approaches for optimal outcomes. For instance the use of graph-native reasoning, or the incorporation of symbolic reflection during the thinking phase. We believe that our initial work uncovered several directions and questions that future research should address. Another important direction for future research could be to investigate methods that would trigger different reasoning strategies based on task type or allow the model to autonomously detect the best approach. For example, logic-based questions might follow a distinct reasoning pathway compared to materials design or regression tasks. The use of symbolic reasoning may further enhance generalization capabilities, perhaps combined with graph theoretic concepts such as isomorphic analysis as was suggested in other work40. Once developed, this adaptability may ultimately show remarkable flexibility and precision in addressing diverse reasoning challenges.
More sophisticated agentic modeling can be another promising next step, where reasoning or reflection stages are critiqued or assessed for feasibility, particularly in areas such as physical design or materials science. By incorporating reflective critique, the model can continuously refine its reasoning processes. For example, reasoning steps could be critiqued based on real-world constraints, such as physical feasibility or design limitations, to ensure solutions are not only theoretically sound but practically viable.
Models can also benefit from improved reasoning feedback loops, where the reasoning steps are continuously refined based on the input obtained from the reflection phase, ultimately leading to higher-quality outputs. For instance, if an initial reasoning process lacks key considerations about material properties or environmental factors, the reflection process can identify these gaps, leading to a more complete and accurate solution in the final output. Naturally, the method can be expanded also to offer a variety of reasoning strategies during the initial Structured Thought Integration Training phase, so that a greater variety of thinking mechanisms can be utilized in the second phase. This iterative enhancement of reasoning will result in models that are not only more intelligent but also capable of producing outputs that are better aligned with complex, real-world challenges.
Materials and methods
Special tokens for reasoning
In this work, several special tokens were introduced to improve the structured reasoning and reflection capabilities of the model. These tokens are integrated into the tokenizer of the
The following special tokens were added (Table 3):
Table 3. List of special tokens used during model training with the updated Llama-3.2 tokenizer
Token ID | Token |
|---|---|
128252 | |
128253 | |
128250 | |
128251 | |
128254 | |
128255 |
Only the
<∣response∣> and<∣/response∣> - Used to demarcate the boundaries of the final answer or response provided by the model.<∣reflect∣> and<∣/reflect∣> - Used to mark the reflection phase, where the model evaluates and improves upon its initial reasoning.<∣thinking∣> and<∣/thinking∣> - Used to denote the thinking phase, where the model generates its reasoning steps.<∣scratchpad∣> and<∣/scratchpad∣> - Optionally used to provide a scratchpad for interim steps, allowing the model to store intermediary calculations or thoughts during inference.
These tokens allow for a clear delineation of different reasoning processes and phases within the model’s output, enabling it to engage in reflective and structured thinking. Below is a summary of these tokens and their properties, including their token IDs in the customized tokenizer. These tokens are instrumental in organizing and structuring the model’s reasoning and reflection capabilities, allowing for more precise control over the model’s inference and answer generation process. The tokenizer is available as part of the models developed in this work, or separately at
On-the-fly dataset generation via in-situ knowledge extraction
The algorithm is designed to generate questions from a given context and provide both correct and incorrect answers. The process is conducted in-situ during training and consists of several key steps, which are described below (Fig. 12).
[See PDF for image]
Fig. 12
Overview of the algorithm used for question generation and answering, leading to prompt, chosen, and rejected responses.
Context enhancement with RAG during dataset generation
The context is enriched using RAG. This process involves querying the index with the generated question to retrieve additional relevant information and reasoning, which is then appended to the original context.
We build an index of text embeddings to facilitate efficient RAG. It transforms each text chunk Ti from a corpus of original raw data into a dense vector representation vi using the embedding model:
1
where fembed is the embedding function. When a query Q is generated, it is similarly encoded into a vector vq:2
Llama Index then computes the cosine similarity between vq and each vi in the index:3
The most relevant vectors are selected based on this similarity measure, retrieving the corresponding text chunks, Tj, which are then appended to the original query context. This expanded context allows the LLM to generate a response that incorporates both the retrieved information and the pre-existing knowledge, improving the depth and relevance of the output.We use the
Raw data used for training
We use 500 scientific papers from the domain of biological and bio-inspired materials as the training data, as reported in earlier work14. To construct the raw corpus of text, we convert all PDFs into Markup language and then create text chunks. We use the LlamaIndex SentenceSplitter function with chunk size of 1024 tokens with chunk overlap of 20 tokens.
Context retrieval
The algorithm first retrieves relevant context information from a pre-constructed index of nodes. When a specific topic is provided, it selects nodes related to that topic; otherwise, it retrieves a random set of nodes (n = 3 in the work reported here). The text from the selected nodes is concatenated into a single context, which serves as the basis for question generation. The token length of the concatenated context is computed using a tokenizer.
Question generation
A domain-specific question is generated based on the provided context using a text generation model. The question is formulated to capture an important aspect of the context, without referring to specific studies, papers, or authors. The question is intended to be challenging, requiring expert-level knowledge to answer. The prompt used is:
Category-based information extraction
The algorithm extracts structured information from the context based on several predefined categories43. These categories include reasoning steps, relevant materials, and design principles, among others. For each category, the model generates a well-reasoned, concise explanation, which contributes to a deeper understanding of the question. The predefined categories are listed in Table 4.
Table 4. Categories used for extracting structured information from the context
Category | Description |
|---|---|
Reasoning steps | Logical steps that explain the reasoning behind the answer. |
Relevant materials or concepts | Key materials or scientific concepts related to the context. |
Design principles | Design-related considerations from the context. |
Material properties | Important properties of materials discussed in the context. |
Hypothesis | A proposed explanation based on the context. |
For each of the categories shown in Table 4, we use this prompting strategy:
This approach ensures a highly structured strategy to thinking through a particular problem space or domain. It can be modified, e.g., via the use of a set of special tokens and/or specially trained LoRA adapters44, to obtain specific thought processes that align with a particular aspect of reasoning. For instance, we can focus one reasoning process on design, another on manufacturing, another on biology, and so on. In the scope of symbolic reasoning, this can also be used to create a higher abstraction of the reasoning process.
In the work reported in this paper, we limit the scope to a single thinking section but with multiple categories embedded within to represent multiple streams of analysis. Future research could easily expand this initial approach to more complex approaches.
Thinking section for reasoning
The extracted information from each category is assembled into a “Thinking Section for Reasoning” This section is designed to aid in the reasoning process by providing structured, logical insights. The Thinking Section includes key pieces of information from each category, which help guide the construction of the correct answer. It serves as a structured reasoning framework for answering the question.
Correct and incorrect answer generation
The correct answer is generated using the context and the Thinking Section. The reasoning included in the Thinking Section helps to formulate a well-structured and comprehensive response. Additionally, an incorrect (rejected) answer is generated either by a trained model or through a prompt-based approach. The rejected answer lacks logical reasoning and does not reference the correct context.
The correct answer is generated as follows:
In the first training stage, the rejected answer is generated by requesting the model to create an incorrect answer, as follows:
It is noted optionally, and used always in the second stage of training, the current trained model can be used to generate an answer using simply the question.
Final output
The algorithm outputs three elements:
The generated question with an instruction to include the Thinking Section for Reasoning.
The correct answer, which is enhanced with the structured Thinking Section for Reasoning.
The rejected answer, which is designed to be incorrect and devoid of proper reasoning (in ORPO phase) or an answer generated based on the current trained state of the model (in DPO/EXO phase).
Reflection section
When we use an additional reflection section, we introduce an introspective step in the algorithm that critiques the reasoning process used to generate an answer. This function asks the model to evaluate the thinking behind the generated answer and suggest improvements. The reflection process is guided by the following prompt:
Models used for dataset generation
We use the
Handling reasoning tokens in preference alignment loss computation
In our algorithm we revise conventional preference optimization frameworks to handle learning intermediate reasoning steps, referred to as “thinking tokens.” These tokens represent the model’s internal reasoning processes and are enclosed by special tokens:
Masking of thinking tokens in multiple sections
In this approach, all tokens between
For a sequence of token IDs t = [t1, t2, …, tn] and log-probabilities p = [p1, p2, …, pn], a boolean mask m = [m1, m2, …, mn] is applied, where:
4
The masked log-probabilities pmasked are computed as:5
where ⊙ denotes element-wise multiplication.This approach ensures that the inner thinking tokens are ignored during loss computation, while the model is still encouraged to generate the correct reasoning markers. The loss is then calculated as:
6
where β is a temperature parameter, pmasked,chosen are the masked log-probabilities for the chosen response, and pmasked,rejected are those for the rejected response.Additionally, we introduce flexibility by allowing a fraction of the inner thinking tokens to be masked, controlled by a parameter α. For a sequence of n thinking tokens, ⌊α ⋅ n⌋ tokens are randomly selected for masking, where 0 ≤ α ≤ 1. Setting α = 0 results in no masking, while α = 1 masks all inner thinking tokens. Figure 13 shows a flowchart that explains the process when all thinking tokens are masked, either in multiple thinking sections or all thinking tokens before the answer is developed.
[See PDF for image]
Fig. 13
Flowchart showing how thinking tokens are masked, and dynamic answer comparison is applied.
If thinking tokens are masked, the start and end are identified and masked accordingly. If dynamic answer comparison is enabled, masking occurs before the final answer. All experiments done as part of the study reported in this paper used only one thinking section, which is equivalent to the “Dynamic Answer Comparison” approach, where we mask all content before the final answer. All tokens within the thinking start/end tokens are masked, whereas the thinking start/end tokens are provided to the model to trigger it to use that particular process.
Dynamic final answer detection via masked thinking tokens
To provide the model with more flexibility in producing variable-length thinking periods, we introduce a dynamic final answer detection approach. Instead of masking tokens, this approach dynamically identifies the final answer by detecting the last occurrence of the
7
Log-probabilities for the final answer are computed as:8
The DPO loss is then calculated based only on the log-probabilities for the final answers, as:9
This approach allows the model to produce reasoning steps of variable lengths while ensuring that the comparison focuses solely on the final answer. As in the other method, the last end-thinking token is optionally excluded from the comparison.Comparison of masking approaches
The two approaches—masking of thinking tokens and dynamic final answer detection—provide complementary mechanisms for handling reasoning steps during training. Masking focuses on excluding inner thinking tokens while ensuring the model learns to produce the start and end reasoning markers, whereas dynamic detection provides more flexibility by ignoring the intermediate reasoning tokens altogether and focusing on the final output. These methods are toggled via the
Optimizing final answers with masked reasoning using mode-seeking preference alignment
We employed the EXO method27 in the special case where K = 2 optimizing the model to align with the ground truth answers while masking intermediate reasoning tokens. While masking these tokens, we observed significant improvements in performance when using EXO compared to DPO. We additionally applied label smoothing with a smoothing factor of 5 × 10−3 and set the scaling parameter β to 0.1. We use a learning rate of 5 × 10−7 and a maximum grad norm of 0.3. We run one epoch of training each time a new on-the-fly dataset is generated, each featuring a set of 50 samples.
The EXO method minimizes the reverse Kullback-Leibler (KL) divergence, which promotes a mode-seeking behavior in the model. Mode-seeking ensures that the model focuses on generating the most likely and preferred answers based on the training data, prioritizing reasoning paths that consistently lead to correct outcomes. This is in contrast to DPO’s mean-seeking behavior, which balances multiple potential reasoning paths, often resulting in more generalized and less effective answers.
In the special case of K = 2, the EXO loss is expressed as:
10
where pθ and pr represent the empirical distributions from the learned policy and the preference model, respectively, over the two completions: yw (the ground truth answer) and yl (the rejected answer predicted by the current model being trained). The scaling parameter β (set to 0.1) controls the strength of the preference alignment.We mask the reasoning tokens during training, which prevents the model from directly observing how to reason through problems. However, the model still learns to infer the likely reasoning pathways based on the surrounding context and the final answer. EXO’s mode-seeking behavior plays a crucial role here, as it ensures that the model fills in the masked thinking tokens by inferring the most effective reasoning patterns that align with successful final answers. This is crucial because even though the thinking tokens are hidden, EXO thereby encourages the model to predict the reasoning that most often leads to the correct solution.
By focusing on the final answer during training, EXO effectively optimizes the entire reasoning process by indirectly learning which reasoning paths are most effective. It does so by leveraging the preference between the ground truth answer (yw) and the rejected answer (yl). The model learns to generate answers that are more likely to match the preferred reasoning processes by inferring and prioritizing the key modes in the training data, even when reasoning is masked. The use of label smoothing further regularizes the model by preventing it from being overly confident in its predictions, encouraging smoother distributional outputs.
EXO versus DPO: impact on final responses
In comparison to DPO, which minimizes the forward KL divergence and thus adopts a mean-seeking approach, EXO ensures that the model focuses on the most dominant reasoning patterns that lead to correct answers. DPO’s mean-seeking behavior leads the model to spread probability mass across multiple reasoning paths, which can dilute its performance. In contrast, EXO concentrates on the most likely and correct answers, ignoring less effective or outlier reasoning paths, thus leading to more precise and higher-quality answers.
This is especially relevant in our case, where the final answer is prioritized in training. While the thinking tokens are masked, the final answer provides the key signal for optimizing the model’s reasoning capabilities. EXO helps the model learn to predict the most likely and effective reasoning paths that lead to that final answer, ensuring strong alignment with ground truth reasoning patterns, even without direct observation.
The use of EXO in our training, combined with label smoothing and the chosen β value, led to superior alignment with the ground truth, particularly in scenarios where final answer accuracy was prioritized over intermediate reasoning visibility, with overall better long-term training performance and less overfitting (Fig. 4).
Iterative response improvement using thinking and reflection model
As an an expert to test iterative improvement at test-time, we implemented a recursive response algorithm that leverages the reasoning model to generate, improve, and integrate responses to a given question. This approach aims to produce more refined, comprehensive, and well-reasoned answers through iterative improvement and integration of multiple perspectives, using two LLM agents (see Fig. 6 for a detailed flowchart).
The algorithm operates using two distinct LLM agents:
Reasoning Model (Agent 1): This is a fine-tuned model specifically designed for advanced reasoning capabilities. It generates responses that include both a thinking process and a reflection on that process. The thinking process represents the model’s step-by-step reasoning approach to answering the question, while the reflection component critically evaluates this thinking process.
Critic Model (Agent 2): This is a non-fine-tuned, general-purpose language model. Its role is twofold:
To improve the thinking process based on the reflections provided by the reasoning model.
To integrate all generated responses into a final, comprehensive answer.
The reasoning model is the trained model, and the critic model was delineated as
The algorithm proceeds as follows:
The reasoning model generates an initial response to the question, including both a thinking process and a reflection on that process.
For a specified number of iterations (N):
The thinking process and reflection are extracted from the most recent response.
The critic model analyzes the reflection and generates an improved version of the thinking process.
The reasoning model then generates a new response using this improved thinking process as a guide.
A new reflection is extracted from this response for use in the next iteration.
After all iterations are complete, the critic model performs a final integration step. It combines all generated responses into a comprehensive answer, leveraging the diverse perspectives and improvements made throughout the iterative process.
This iterative approach, leveraging the strengths of both a specialized reasoning model and a general-purpose critic model, allows for continuous refinement of the reasoning process. The fine-tuned reasoning model provides structured, thoughtful responses, while the critic model offers a fresh perspective for improvement and integration. This combination potentially leads to more nuanced, well-considered, and comprehensive final responses.
The use of thinking and reflection tokens (
By separating the roles of response generation (reasoning model) and response improvement/integration (critic model), the algorithm benefits from both specialized reasoning capabilities and general language understanding. This separation also allows for potential improvements or replacements of either model independently, enhancing the flexibility and scalability of the approach.
Analysis of iterative result quality
We use
The categories identified by
Coherency: measures the logical flow and structure of the response.
Accuracy: how well the response aligns with established scientific facts regarding biological materials and failure mechanisms.
Depth of explanation: the level of detail and technical understanding conveyed in the response.
Clarity: how easily understandable the explanation is, especially in communicating complex scientific ideas.
Training stages
The first training stage, using ORPO, is conducted using a Low-Rank Adaptation (LoRA)44 mechanism with a rank of r = 64 and α = 64. This design is applied to the following layers of the transformer architecture:
In the second training stage, we create a new adapter (built on top of the resulting model obtained after the first stage). We use r = 64 and α = 64, applied to the following layers of the transformer architecture:
The second phase focuses less on the intermediate reasoning steps and more on producing accurate final answers. Instead of explicitly guiding the model through the reasoning process, this stage lets the model figure out the best reasoning paths on its own, using the knowledge and structures learned during the first phase. The EXO-based preference alignment optimizes the final answer, allowing the model to concentrate on mode-seeking behavior, where it prioritizes the most likely and correct answer based on learned reasoning strategies. By focusing on the outcome, this phase ensures the model produces high-quality answers while letting it autonomously handle reasoning complexity.
Thus, the combination of ORPO in the first stage for learning basic reasoning, and preference optimization via EXO in the second stage for refining answer accuracy, results in a robust, multi-phase training process that improves both the model’s reasoning abilities and its final outputs.
The training algorithms are implemented on top of the Hugging Face Transformer Reinforcement Learning library https://huggingface.co/docs/trl/en/index, specifically using inherited classes from the ORPO and DPO trainers.
Brief recap of key concepts in Hesse’s novel The Glass Bead Game
Hermann Hesse’s “The Glass Bead Game”36,45 is a novel that envisions a society where intellectual pursuit and the synthesis of knowledge are paramount. Central to the story is the province of Castalia, a secluded community dedicated to scholarly excellence and the mastery of the Glass Bead Game. The Game itself is an abstract, highly intellectual practice that symbolically combines elements from various disciplines—such as music, mathematics, literature, and philosophy—into a cohesive and harmonious whole.
In the game, players use sophisticated symbolic language to explore and reveal the underlying connections between disparate fields of knowledge, embodying ideals of interdisciplinary synthesis and deep contemplation. The novel reflects philosophical traditions that emphasize the unity of knowledge and the interconnectedness of all things, resonating with concepts from Neoplatonism, which posits a single underlying reality, and Eastern philosophies like Taoism and Zen Buddhism, which highlight the harmony and balance inherent in the universe. Hesse’s work invites reflection on the role of intellectualism in society, the pursuit of enlightenment, and the balance between contemplation and engagement with the material world.
While the precise mechanics of the Glass Bead Game are intentionally left abstract by Hesse, within the novel, it is portrayed as a highly formalized and complex practice. The Game involves the creation and manipulation of symbols that represent ideas from various disciplines, structured in a way that reveals their underlying relationships and harmonies. Players, who have devoted years to study and preparation, engage in elaborate sessions where they compose and interpret these symbolic sequences, often accompanied by meditative reflection. The highest authority in the Game is the Magister Ludi (Latin for “Master of the Game”), who oversees the advancement and integrity of the Game within Castalia. The Magister Ludi is responsible for guiding the community of players, fostering the development of the Game’s symbolic language, and ensuring that the practice remains aligned with the ideals of wisdom, intellectual purity, and the unification of knowledge that the Game embodies.
By paralleling the vision in the Glass Bead Game, where the synthesis of all knowledge transforms human understanding, the rise of AI and complex reasoning engines are dramatically expanding the scope of the human intellect, shifting our role from mere information processors to interpreters, ethical stewards, and collaborators with intelligent systems. This evolution challenges us to redefine our intellectual boundaries and embrace the profound impact of AI on human thought, creativity, and the collective pursuit of wisdom.
Relation between preference optimization and RL strategies
In the DPO setup, we typically optimize based on preference comparisons of chosen versus rejected outputs. However, by generating training data randomly at every step and dynamically constructing reference reasoning, the model’s training environment begins to incorporate elements of meta-learning. This randomization introduces exploration, adaptability, and generalization, making the process more akin to meta-learning, where the model is trained not just to solve a specific task, but to learn how to solve tasks in general.
Meta-learning (“learning to learn”) aims to develop models that can quickly adapt to new tasks by leveraging prior knowledge. In this context, by generating random training data and reference reasoning at each training step, we are effectively training the model on a distribution of tasks. Each task (random reasoning structure) represents a variation in reasoning, and the model learns not just how to answer specific reasoning problems, but how to reason in general.
One of the key principles of meta-learning is task distribution. Rather than a fixed, static dataset, the model encounters a wide variety of reasoning challenges, each randomly generated. This encourages the model to develop a generalized reasoning framework that can adapt quickly to new tasks, mirroring the concept of few-shot learning in meta-learning.
Another significant aspect is an exploration of reasoning paths. Since the reference reasoning is constructed dynamically, the model must learn to explore different reasoning strategies each time. This exploration forces the model to generalize better because it cannot rely on memorizing patterns from repeated examples. This is similar to RL where the agent explores different actions and learns from the consequences of those actions. In this DPO setup, however, the exploration happens in the space of reasoning paths rather than state-action sequences, and the reward signal is in the form of preference alignment rather than cumulative rewards.
By incorporating partial masking in the reasoning process, the model is trained to handle incomplete information. This is similar to RL agents making decisions with partial observability. The masking challenges the model to optimize for the final answer despite missing parts of the reasoning. It also encourages the model to focus more on the outcome (the preferred answer) rather than the exact steps taken to reach that answer. This aligns well with the preference optimization framework, which optimizes based on output preference rather than the reasoning steps themselves.
Moreover, the randomization of data and reasoning structures encourages the model to develop a meta-strategy for reasoning. Instead of learning a fixed method for solving a specific task, the model learns to adapt to new reasoning patterns dynamically. This meta-strategy helps the model generalize across a variety of reasoning tasks, much like how a meta-learning model generalizes across different tasks by learning higher-level patterns and strategies.
By generating data and reference reasoning on-the-fly, we are also introducing a form of exploration that is typically seen in RL. In RL, exploration is necessary for the agent to discover optimal policies, while in this DPO setup, the exploration arises from the random selection of reasoning data and paths, requiring the model to adapt its reasoning strategy each time. The difference, however, is that DPO focuses on preference optimization rather than cumulative rewards over time.
Furthermore, this dynamic and random training process allows for fast adaptation, a key element in meta-learning. The model is forced to adapt quickly to new tasks (reasoning paths) and make decisions based on incomplete or masked information. This is akin to the few-shot learning scenario in meta-learning, where models must learn to perform tasks with minimal data. In this case, the model learns to reason effectively even when part of the reasoning process is hidden.
Finally, the overall structure of this training setup, with dynamic data generation and reasoning masking, creates a meta-learning-like environment where the model develops the ability to handle random and diverse reasoning tasks without overfitting to any specific pattern. The model is learning not just to perform well on a single task, but to adapt quickly and generalize across a wide range of reasoning challenges, which is the core goal of meta-learning. Our experiments have confirmed that this was indeed the case.
Modification to further align with RL
To modify this method further and make it more similar to RL, the process of generating random training data could be formalized as a sequential decision-making problem, where each step in the reasoning process is treated as an action, and the sequence of reasoning steps forms a trajectory. In this scenario, new random training data would be constructed as a result of decisions made by the model in a feedback loop, where each reasoning step influences the next.
For instance, rather than randomly selecting data for each training step, the model could be given the ability to actively select which reasoning path or data point to explore based on a learned policy. This would involve assigning a reward signal based on how well the reasoning aligns with human preferences at the end of the trajectory. The model would be rewarded not only for generating the correct final output but also for choosing intermediate reasoning steps that lead to better outcomes. This reward could be delayed, encouraging the model to think strategically about the long-term effects of its reasoning choices, much like in RL, where the agent learns to maximize cumulative rewards over time.
Moreover, the random generation of data could be made contingent on past reasoning steps, transforming the reasoning process into an interactive environment where the model’s actions (reasoning decisions) impact the data it encounters next. In this way, the reasoning process becomes more dynamic, with the model continually adjusting its path based on feedback from previous steps, which aligns closely with the exploration and exploitation trade-off seen in RL. These examples show the rich set of future research that can be done based on the concepts proposed in this paper.
Acknowledgements
This work was supported in part by Google, the MIT Generative AI Initiative, with additional support from NIH.
Author contributions
M.J.B. designed the overall research, developed algorithms and codes, and conducted the training, experimental assessments, and data analysis. He developed and executed the distillation strategies to generate structured and adversarial data. He wrote and edited the paper.
Data availability
No datasets were generated or analyzed during the current study.
Code availability
Codes, including training scripts, are available at https://github.com/lamm-mit/PRefLexOR.
Competing interests
The author declares no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Vaswani, A. et al. Attention is all you need. https://papers.nips.cc/paper/7181-attention-is-all-you-need (2017).
2. Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training https://gluebenchmark.com/leaderboard
3. Xue, L et al. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Trans. Assoc. Comput. Linguist.; 2021; 10, pp. 291-306. [DOI: https://dx.doi.org/10.1162/tacl_a_00461]
4. Jiang, A. Q. et al. Mistral 7B http://arxiv.org/abs/2310.06825 (2023).
5. Phi-2: The surprising power of small language models - microsoft research. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
6. Dubey, A. et al. The Llama 3 herd of models https://arxiv.org/abs/2407.21783 (2024).
7. Buehler, M. J. MeLM, a generative pretrained language modeling framework that solves forward and inverse mechanics problems. J. Mech. Phys. Solids 105454 https://linkinghub.elsevier.com/retrieve/pii/S0022509623002582 (2023).
8. Singhal, K. et al. Large language models encode clinical knowledge. Naturehttps://www.nature.com/articles/s41586-023-06048-6 (2023).
9. Qu, J et al. Leveraging language representation for materials exploration and discovery. npj Comput. Mater.; 2024; 10, [DOI: https://dx.doi.org/10.1038/s41524-024-01231-8] 58.
10. Yu, S., Ran, N. & Liu, J. Large-language models: the game-changers for materials science research. AI in Chemical Engineering 100076 https://doi.org/10.1016/j.aichem.2024.100076 (2024).
11. Hu, Y; Buehler, MJ. Deep language models for interpretative and predictive materials science. APL Mach. Learn.; 2023; 1, 010901. [DOI: https://dx.doi.org/10.1063/5.0134317]
12. Buehler, E. L. & Buehler, M. J. X-LoRA: mixture of low-rank adapter experts, a flexible framework for large language models with applications in protein mechanics and design https://arxiv.org/abs/2402.07148v1 (2024).
13. Buehler, M. J. MechGPT, a language-based strategy for mechanics and materials modeling that connects knowledge across scales, disciplines and modalities. Appl. Mech. Rev. https://doi.org/10.1115/1.4063843 (2023).
14. Luu, R. K. & Buehler, M. J. BioinspiredLLM: conversational large language model for the mechanics of biological and bio-inspired materials. Adv. Sci. https://doi.org/10.1002/advs.202306724 (2023).
15. Lu, W., Luu, R. K. & Buehler, M. J. Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities. npj Comput. Mater.11, 84 (2025).
16. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models https://arxiv.org/abs/2201.11903 (2023).
17. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners https://arxiv.org/abs/2205.11916 (2023).
18. Cranford, S. & Buehler, M. Materiomics: biological protein materials, from nano to macro.3, 127–148 (2010).
19. Cranford, S. W. & Buehler, M. J. Biomateriomics,https://link.springer.com/book/10.1007/978-94-007-1611-7 (Springer Netherlands, 2012).
20. Buehler, MJ. A computational building block approach towards multiscale architected materials analysis and design with application to hierarchical metal metamaterials. Model. Simul. Mater. Sci. Eng.; 2023; 31, 054001. [DOI: https://dx.doi.org/10.1088/1361-651X/accfb5]
21. Groen, N., Cranford, S., de Boer, J., Buehler, M. & Van Blitterswijk, C. Introducing materiomics, In: High-Throughput Screening of Biomaterial Properties, Cambridge University Press,https://doi.org/10.1017/CBO9781139061414.002 (2011).
22. Lee, NA; Shen, SC; Buehler, MJ. An automated biomateriomics platform for sustainable programmable materials discovery. Matter; 2022; 5, pp. 3597-3613. [DOI: https://dx.doi.org/10.1016/j.matt.2022.10.003]
23. Arevalo, S. E. & Buehler, M. J. Learning from nature by leveraging integrative biomateriomics modeling toward adaptive and functional materials. MRS Bull. 1–14 https://link.springer.com/article/10.1557/s43577-023-00610-8 (2023).
24. Ghafarollahi, A. & Buehler, M. J. SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning. Adv. Mater.https://doi.org/10.1002/adma.202413523 (2024).
25. Hong, J., Lee, N. & Thorne, J. ORPO: Monolithic preference optimization without reference model https://arxiv.org/abs/2403.07691 (2024).
26. Rafailov, R. et al. Direct preference optimization: Your language model is secretly a reward model https://arxiv.org/abs/2305.18290 (2024).
27. Ji, H. et al. Towards efficient exact optimization of language model alignment https://arxiv.org/abs/2402.00856 (2024).
28. Saeidi, A., Verma, S. & Baral, C. Insights into alignment: evaluating dpo and its variants across multiple tasks https://arxiv.org/abs/2404.14723 (2024).
29. Madaan, A. et al. Self-refine: iterative refinement with self-feedback https://arxiv.org/abs/2303.17651 (2023).
30. Shinn, N. et al. Reflexion: language agents with verbal reinforcement learning https://arxiv.org/abs/2303.11366 (2023).
31. Liu, F., AlDahoul, N., Eady, G., Zaki, Y. & Rahwan, T. Self-reflection makes large language models safer, less biased, and ideologically neutral https://arxiv.org/abs/2406.10400 (2025).
32. Zelikman, E., Wu, Y., Mu, J. & Goodman, N. D. Star: Bootstrapping reasoning with reasoning https://arxiv.org/abs/2203.14465 (2022).
33. Zelikman, E. et al. Quiet-star: language models can teach themselves to think before speaking https://arxiv.org/abs/2403.09629 (2024).
34. Brown, T. B. et al. Language models are few-shot learners. https://arxiv.org/abs/2005.14165 (2020).
35. run-llama/llama_index: LlamaIndex (formerly GPT Index) is a data framework for your LLM applications. https://github.com/run-llama/llama_index
36. Hesse, H. The glass bead game (Holt, Rinehart and Winston), 1943.
37. Shen, SC et al. Robust myco-composites: a biocomposite platform for versatile hybrid-living materials. Mater. Horiz.; 2024; 11, pp. 1689-1703. [DOI: https://dx.doi.org/10.1039/D3MH01277H]
38. Ni, B. & Gao, H. A deep learning approach to the inverse problem of modulus identification in elasticity. MRS Bull. 1–7 https://www.cambridge.org/core/product/identifier/S0883769420002316/type/journal_article (2020).
39. Maurizi, M; Gao, C; Berto, F. Inverse design of truss lattice materials with superior buckling resistance. npj Comput. Mater.; 2022; 8, pp. 1-12. [DOI: https://dx.doi.org/10.1038/s41524-022-00938-w]
40. Buehler, M. J. Accelerating scientific discovery with generative knowledge extraction, graph-based representation, and multimodal intelligent graph reasoning. Mach. Learn. Sci. Technol.5, 035083 (2024).
41. Wu, Q. et al. AutoGen: enabling next-gen LLM applications via multi-agent conversation https://arxiv.org/abs/2308.08155v2 (2023).
42. Llama 3.2 model cards and prompt formats https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2 (2024).
43. Giesa, T., Spivak, D. & Buehler, M. Category theory based solution for the building block replacement problem in materials design. Adv. Eng. Mater.14, 810–817 (2012).
44. Hu, E. J. et al. LoRA: low-rank adaptation of large language models https://arxiv.org/abs/2106.09685v2 (2021).
45. Ziolkowski, T. The novels of hermann hesse: a study in theme and structure; 1965; Princeton, NJ, Princeton University Press:
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.