Content area
Varietal improvement is a key aspect of breeding, and as a result of this work, crop varietal data becomes more complicated, requiring more resources to extract. As a result, we developed Chat-RGIE, a rice germplasm data extraction strategy based on conversational large language models (LLM) and cue word engineering, to achieve rice germplasm data extraction in a ZERO-shot manner. The technique employs multi-response voting to limit the chance of phantom appearances, as well as an additional calibration component to choose the best data extraction findings. We performed performance evaluation and real-life data extraction evaluation on Chat-RGIE, and the scheme obtained 0.9102 precision, 0.9941 recall, and 0.9554 accuracy in performance evaluation, and 0.6351 precision, 1.0 recall, and 0.8225 accuracy in real-life data extraction evaluation, which completely proved the effectiveness of the scheme. Furthermore, the well-designed data extraction procedure mitigates the likelihood of potential bias from a single large model leading to hallucinations to some extent, with the incidence of hallucinations in the two evaluations being 0.0015 and 0.005, respectively, with a very minor influence. Furthermore, we employed Restraint Rate, a statistic used to quantify the degree of limits placed by the prompt on LLM replies, with values of 0.9265 and 0.911 in the two evaluations, resulting in normative responses. Furthermore, when we examined the data extraction results, we discovered that when confronted with an unanswerable answer, the LLM is affected by the stress provided by the prompt, and the higher the stress, the more likely it is to engage in constraint-violating behavior, which is similar to what humans do when stressed. We therefore believe that some of the countermeasures in the human behavior in question also have the potential to help improve LLM performance.
Introduction
Rice is a critical staple crops, but the rate of increase in production of major crops, including rice, is insufficient to keep up with population growth [1, 2]. Population increase, dwindling freshwater supplies, and limited arable land pose severe threats to global food security. With limited natural resources, contemporary technology to investigate rice yield potential and considerably improve rice output has become a high priority in rice breeding. The discovery and development of beneficial crop genes is a key step toward improving crop varieties. Many scientists have worked on the collection and integration of rice germplasm data, as well as the development of databases such as BGI-RIS [3] and the Rice Genetics and Genomics Database [4], with the goal of lowering the cost of acquiring breeding materials through the structured processing and standardized management of rice germplasm data, thereby accelerating breeding.
However, these databases are still insufficient to cover all of the breeding requirements. Breeders’ information requirements for rice germplasm resources are divided into three categories, namely, “What rice germplasm resources are available?” What rice resources do I need? “How can I make good use of these rice resources?” To meet these three levels of needs, three types of information services must be provided: a general list of rice germplasm resources, tailored search and screening of rice germplasm resources, and the foundation of prior research on the targeted rice germplasm resources. Existing research have primarily provided lists and basic phenotypic information, which can only meet the needs of the first level.As a result, breeders using these platforms can only get a rough idea of what germplasm resources are currently being shared and utilized, but they can’t tell you what characteristics these resources have, whether they’ve been improved for breeding and gene mining, and how difficult they are to access and apply. Furthermore, as breeding work progresses and new rice is produced, the amount of rice germplasm data grows, and the sources of data expand, including but not limited to books such as introduction records and variety journals, papers related to germplasm resource research and application, various related databases, and news reports on the discovery and application of excellent germplasm resources. The difficulty and cost of accurately extracting and organizing rice germplasm data from various sources is also increasing.
Existing databases are frequently constructed using natural language processing techniques and language models for automated data extraction to reduce creation costs [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25–26]. Large language models [27, 28, 29, 30–31], a relatively novel technique in the field of natural language processing, are thought to have more powerful data extraction capabilities than earlier techniques [32, 33], and there are related research that use large language models to extract data [26]. Compared to classic extraction approaches with limited generalization capabilities, conversational bigrama models pre-trained on a general corpus can frequently provide standard replies on general knowledge. Conversational LLMs’ outstanding generalized linguistic competence ensures high-quality data extraction with no upfront work and no additional training. LLM performs well on generalized datasets with limited or zero sample sizes [34, 35]. Larger model or data volumes, more training calculations, capacity bootstrapping, alignment fine-tuning, and hardware improvements could all contribute to enhanced performance [36, 37]. LLM’s capacity to generate conditional text [38] has been well validated. Studies using MMLU measurements have shown that Chinchilla [39] is nearly twice as accurate as the typical human in a given situation. In terms of entity recognition, LLM can gather contextual information and then infer entity boundaries and kinds [40, 41]. In addition, LLM may align with humans, communicate with external environments, operate tools, annotate data, and develop itself [42, 43]. Although LLM has demonstrated excellent performance in generating human-like text, there are still some challenges to overcome if LLM is to be used for high-quality data extraction from rice germplasm resources, such as resolving language generation issues in order to adapt to the task scenario and minimizing the possibility of generating incorrect information due to LLM’s shortcomings.
LLM is prone to the issues associated with both regulated and specialized language development. Controllable generation refers to the widely used method of leveraging plain language instructions or prompts for LLM to generate text under specific situations. While this approach is easy, adding fine-grained or structural limits on the model’s output presents considerable hurdles. Existing research [44] has demonstrated that when complicated structural constraints are imposed on the generated text, LLM can handle local relationships well (e.g., interactions between surrounding sentences) but may struggle to address global links (i.e., long-range correlations). When dealing with particular fields or tasks, LLM may still encounter difficulties in generating documents containing words and processes. Domain knowledge is intuitively important for model specialization. However, incorporating this knowledge into LLM is not easy. It has been demonstrated [45, 46] that when LLMs are trained to exhibit unique skills that allow models to flourish in one domain, they may suffer in another. This issue is connected to catastrophic forgetting in neural network training [47, 48], which is the occurrence of conflict when combining old and new knowledge. A similar scenario exists in the human alignment fine-tuning of LLM, where a “alignment tax” [49] must be paid in order to match the model with human values and requirements. Exploring safe and effective techniques to place limits on LLM outputs, as well as building effective model specialization approaches, are required to adapt LLM to rice germplasm data extraction job scenarios while keeping its original capabilities to the greatest extent possible.
Furthermore, LLM difficulties such as illusion [35], bias, real-time knowledge, numerical computation, and inconsistency might generate disinformation and reduce the effectiveness of data extraction in LLM.
Hallucination [50] is defined as the creation of knowledge that contradicts or cannot be validated by existing sources. Hallucinations are common in existing LLMs, and even the greatest LLMs, such as GPT-4 [35], are susceptible to them. Essentially, LLM appear to “unconsciously” apply this knowledge in problem solving, lacking the ability to accurately control the use of internal or external knowledge. Illusions can mislead LLMs into producing undesired outputs and, in most cases, diminish their performance, providing a danger to deploying LLMs in real-world applications. LLMs may also generate text containing sensitive information or offensive expressions [35], and while the RLHF algorithm [49] can mitigate this problem to some extent, it still requires a significant amount of manually labeled data to fine-tune LLMs and lacks an objective optimization goal. The challenge posed by the real-time nature of knowledge is that the LLM encounters difficulties when faced with tasks that require the use of knowledge that is more recent than the training data, for which the LLM must be regularly updated with new data, whereas fine-tuning the LLM is too expensive and is likely to result in Catastrophic Forgetting [47]. As breeding efforts continue and new rice is created, real-time information must be considered in order to satisfy the needs of the rice germplasm resource field, which is continually generating new data. Even if the LLM is simply fine-tuned on a periodic basis, the long-term costs cannot be overlooked. Numerical computing implies that for sophisticated inference tasks, LLM still confronts numerical computation challenges, particularly for symbols that are infrequently encountered in the pre-training phase, such as arithmetic operations on big numbers [51, 52]. Similarly, some terminology terms and corresponding abbreviations commonly used in rice germplasm data, such as EDTA (ethylenediaminetetraacetic acid), PCR (polymerase chain reaction), RNase A (polymerase chain reaction), TE (polymerase chain reaction), Tris (tris(hydroxymethyl)aminomethane), and these symbols may rarely appear during the pre-training phase of the LLM, which is likely to result in the LLM’s inability to understand the true meaning of these terms. Inconsistency specifically refers to the fact that LLM may generate the correct answer despite an incorrect reasoning path, or produce an incorrect answer after a correct reasoning process [53, 54], resulting in an inconsistency between the obtained answer and the reasoning process. In addition, it has been found that inconsistency may also exist between tasks with similar inputs, i.e., small changes in the task description may lead to different results from the model [51, 55]. To ensure the quality of rice germplasm resource data extraction, uniform and stable task descriptions should be used whenever possible.
These flaws cannot be overlooked given the data quality required to build a rice germplasm database. Obviously, introducing expertise into LLM through secondary training can solve some of the challenges that need to be addressed when extracting rice germplasm resource data using LLM, such as specialty generation, real-time knowledge, numerical computation, and so on. However, LLM training is expensive and involves a big quantity of data, technical knowledge, specialized technology, suitable storage, and data security measures [56], therefore Prompt Engineering is a more cost-effective way to adapt LLM. When adopting this strategy, the quality of the prompt words directly correlates with the quality of the responses [57]. Good prompters can increase the accuracy of extracting structured information from LLM, and prompter design has a long-term impact on prompter expansion and maintenance. Researchers investigated prompt word design tactics in order to increase the likelihood that LLMs would generate the intended results [58].Prompt engineering has proven useful in boosting model performance. Prompt engineering approaches like as Chain of Thought (CoT) [53], Tree of Thought (ToT) [59], Graph of Thoughts (GoT) [60], and ReAc (Reason and Act) [61] can dramatically increase LLM reasoning capacity and task performance. Furthermore, strategies like Self-Consistency [62] and Few-Shot Prompting [63] can help LLMs perform better and more consistently.
As previously stated, the existing work on rice germplasm resource mining and utilization can only provide preliminary information on the overall list of rice germplasm resources, but not on the targeted search and screening of rice germplasm resources or the existing research bases of specific rice germplasm resources. The extraction and exploitation of this information is critical for meeting breeder needs as well as further exploring and utilizing rice germplasm resources. However, because this information is still underutilized, there is a scarcity of relevant datasets that can meet the quantitative and qualitative requirements for model training, resulting in a lack of high-quality data extraction results, regardless of whether traditional methods (rule-based, classical machine learning, and deep learning) that rely on manual or annotated data are used for training, or whether LLM secondary training methods are employed. To acquire high-quality data extraction results, it is important to invest significant labor and time in reading a large body of literature on manual data extraction in order to generate the relevant training dataset. Furthermore, in order to ensure real-time knowledge, the dataset must be reassembled and re-trained each time the knowledge is updated, increasing the cost indefinitely, and secondary training may result in catastrophic forgetting.
Therefore, in order to meet breeders’ demand for further information on rice germplasm resources in the absence of labeled data, we proposed Chat-RGIE by combining LLM and prompt engineering, which is a high-quality extraction of rice germplasm data without the need of training with labeled data, and can help researchers to rapidly extract rice germplasm information from rice germplasm and research paper abstracts, allowing further construction of relevant database. Chat-RGIE use LLMs, which outperform standard approaches and are capable of performing data extraction tasks with high quality, in order to reduce knowledge bias as much as feasible. Chat-RGIE employs numerous LLMs in the data extraction process to minimize knowledge bias and leverages rapid engineering to adapt the LLMs to downstream requirements. On the one hand, it avoids the high cost of secondary training of multiple LLMs and the catastrophic forgetting problem that may be caused by secondary training. On the other hand, prompt engineering can be updated and maintained in a more cost-effective way, and it is more suitable for rice germplasm resource. The field of rice germplasm resources is better suited to the ongoing creation of new information. Inside Chat-RGIE, we carefully constructed a set of cue words for rice germplasm data extraction to stabilize and standardize the prompt, reducing the impact of inconsistencies. Furthermore, because to the enormous number of parameters in the big model, it requires a significant amount of memory resources during the decoding stage, making it prohibitively expensive to deploy in real applications. To avoid the arithmetic, data, and technical issues that might arise when utilizing LLMs, Chat-RGIE uses LLMs via contacting api interfaces, which can be utilized by users who do not understand the internal workings of LLMs and also allow Chat-RGIE to change its internal LLMs in a very simple method.
The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes the workflow and details of the proposed Chat-RGIE in detail. Section 4 includes the performance evaluation and real-life data extraction evaluation of Chat-RGIE. The conclusion is presented in Sect. 5.
Related work
Traditional methods
Traditional methods can be categorized into rule-based methods, classical machine learning-based methods, and deep learning-based methods.
Rule-based methods, which rely on explicit rules and algorithms formulated by experts to achieve knowledge extraction, such as the LaSIE-II [64] and NetOw [65] systems, have certain drawbacks, firstly, it is impossible to formulate rules that cover all data, and the rules are poorly portable and difficult to update information, and secondly, this method is extremely demanding on experts, and therefore is not applicable to the style of writing application information of rice germplasm resources with diverse structure.
Classical machine learning approaches emphasize the use of large amounts of labeled data to train and evaluate machine learning models for knowledge extraction, such as Hidden Markov Models [66], Maximum Entropy Models [67], Support Vector Machines [68], and Conditional Random Fields [69], among others. Sampaio et al [70] employed standard machine learning approaches such as Artificial Neural Network (ANN) and Multiple Linear Regression (MLR) to predict rice quality features based on physical metrics (e.g., appearance, milling yield). Tiozon et al. [71] focused not only on metabolomic analysis of rice germplasm resources, but also used the Random Forest (RF) model to extract and predict rice nutritional properties. This method, unlike the rule-based approach, replaces the reliance on expert-developed rules with labeled data; however, it is not relevant to rice germplasm resource data that lacks labeled data due to its reliance on manually labeled data, high requirements, and limited accuracy.
Deep learning-based techniques primarily use various deep neural network models. This method has evolved fast in recent years, with models such as the BiLSTM-CRF model [72], CNN-CRF [73], and Bert [74] emerging [75, 76]. Kumar et al. [77] employed the DLNet (Deep learning-based rice network model) model, which integrates expression profiling data and network data to predict rice defense responses to pathogens. Vourlaki et al. [78] investigated the performance of Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), and other deep learning models in predicting rice shape and compared them to the Bayesian linear model. The findings indicate that deep learning models beat Bayesian models in 75% of the research instances and consistently enhanced predictive ability when predicting binary attributes. Compared to the traditional machine learning-based technique, the method minimizes reliance on manually labeled corpora and achieves higher accuracy, but it still has several flaws. For example, CNN-CRF, which processes text with a fixed-size convolutional kernel, suffers from the long dependency problem and the context fragmentation problem, limiting its ability to deal with discontinuous and nested entity relationships and thus is inapplicable to rice germplasm resource application information with long text. Although the Bert model using the Transformer architecture, which is based on the Attention mechanism for sequence modeling, can learn longer distance dependencies, the basic Bert is still unable to deal with more than 512 labeled text and has limited comprehension of application scenarios with scarce labeled data.
Therefore, due to the lack of sufficient annotated data about what characteristics of rice germplasm resources are identified by what methods, whether there has been any breeding improvement and gene mining and other related information extracted from text, it is not possible to train an ideal rice germplasm resource data extraction model by existing methods, and it is difficult for existing deep learning models to deal with longer rice germplasm data. Moreover, even if suitable rules or models can be constructed, with the growth of rice germplasm data, it is difficult to extend the existing rules and models to new data, and it is necessary to spend a lot of cost to formulate rules or train models again.
LLM-related methods
Since the release of GPT-3 [79] by OpenAI, large language models have attracted much attention within the field of natural language processing. Trained on large amounts of general-purpose text data, these models have shown great potential in acquiring human-like intelligence [27, 31, 35, 80, 81], and are able to perform natural language tasks such as text generation with high quality. Nowadays, LLMs (e.g., GPT-4 [35] and LLaMA[81]) are able to show excellent performance in different domain tasks without the need for fine-tuning, e.g., the GPT family of models introduces unprecedented capabilities in text summarization and classification [82, 83]. However, LLM is not perfect and LLM needs to face new challenges regarding sensitivity and potential bias [84, 85]. When pre-training LLMs, there may be some biases in their training data that may perpetuate and amplify existing societal biases when using LLMs, resulting in LLMs occasionally performing very poorly. This is a common problem that may be very difficult to detect and correct at a later stage [86, 87]. Moreover, when it comes to specialized domains, there is a high probability that LLM will fabricate false facts and answer questions accordingly if they have not been trained on the relevant data, a situation known as “hallucination.” And when dealing with specialized fields or tasks, LLM may still face challenges in generating text involving terminology and methods. Therefore, despite the generalization ability of LLMs, for tasks requiring precision and accuracy, Hajikhani et al. [88] argue that specialized models should be used.
Existing technologies allow us to transform LLMs into task-specific models in a special way, under the premise that data extraction using LLMs is feasible and has great potential for automated or semi-automated data extraction [89]. Many researchers have conducted studies related to structured data extraction using large language models and have made some progress. Most of the methods used in existing studies can be categorized into two types: one is secondary training of LLMs [90, 91], and the other is designing prompt words to achieve data extraction.
There are two techniques to adjusting pre-trained LLMs: instruction tuning [92], which seeks to improve (or unlock) LLM capabilities, and alignment tuning, which seeks to match LLM behavior with human values or preferences. To do instruction tuning, first gather or generate instances in instruction format (instruction-formatted), and then use these instances to fine-tune the LLM in a supervised manner (e.g., trained using sequence-to-sequence loss). After instruction fine-tuning, the LLM can demonstrate improved ability to generalize to previously unknown tasks [92, 93–94].Alignment adjustment has been proposed to prevent unwanted actions such as manufacturing false information, pursuing incorrect goals, and creating harmful, misleading, and biased representations [28, 95]. Alignment tuning varies from instruction tuning in that it takes into account factors such as usefulness, honesty, and harmlessness. It has been demonstrated that alignment adjustment can reduce LLM’s generalization capacity to some amount, a phenomenon known as alignment tax in related studies [28, 96, 97]. Because LLMs include a large number of model parameters, conducting full-parameter fine-tuning would have a significant overhead, hence there is presently related work on applying parameter-efficient fine-tuning approaches for secondary training of LLM. LoRA [98] is commonly used in open-source LLMs to achieve parameter-efficient fine-tuning. However, studies [99] have validated serial adapter fine-tuning [100], parallel adapter fine-tuning [101, 102], and LoRA fine-tuning methods on three open-source LLMs, and the results show that these efficient fine-tuning methods, while comparable in performance on simple tasks, do not perform as well on difficult tasks as the reference benchmark model, GPT3.5. For the difficult rice germplasm resource text, parameter-efficient fine-tuning is not an optimal technique.
Because comprehensive fine-tuning of LLM is expensive and parameter-efficient fine-tuning methods do not increase LLM performance on challenging tasks, natural language-based cueing (also known as Prompt Engineering) provides a more efficient solution to address the rice germplasm resource extraction challenge with LLM.Prompt engineering is one type of prompt tuning that comprises the process of creating optimal prompts to achieve the greatest results. Prompts are supplementary information that the user enters to set conditions for engaging with the LLM and receiving a model answer. This additional information often consists of questions, directions, and examples used as task input markers. Prompt-based techniques are a promising alternative since learning via prompts becomes more efficient and cost-effective as the LLM scales. In addition, unlike secondary training, which requires a different model for each downstream task, prompt-tuning allows a single model to fulfill numerous downstream tasks. Multitask prompts can assist the model generalize the task and complete cross-domain activities. Prompts must be designed to best elicit knowledge and maximize the predictive performance of the language model.When using this approach, the quality of the prompt words is directly related to the quality of the responses obtained [103]. In other words, a well-crafted prompt word can lead to an accurate and high-quality response that maximizes the performance of the model. On the contrary, a poorly structured question may lead to a vague and irrelevant response. Researchers have explored design strategies for cue words with the aim of making LLM more likely to produce the desired results [58].
Some studies have looked into the use of LLM and prompt engineering in agricultural, demonstrating that it is possible to utilize LLM to do specific tasks in agriculture and making some progress, but significant limitations of LLM remain .Zhao et al. [104] investigated the potential of ChatGPT in agricultural text categorization and proposed ChatAgri, a ChatGPT-based text classification framework that uses LLM in the form of an API interface and successfully avoids LLM’s complex and costly local deployment problem.Peng et al. [105] employed LLM to extract structured data from agricultural documents, and although the findings suggested that LLM has a significant potential for agricultural information extraction tasks, the trials revealed that LLM had several When using LLM to translate descriptive language into attributes, the resulting attributes will be slightly skewed (for example, “dark spots” will be identified as “color” or “pattern”). This recognition bias can cause swings in accuracy, and the output when matching values and attributes is insufficient to establish whether tiny differences will cause problems in large-scale testing or real-world applications. Jiajun Q et al. [106] proposed a method for diagnosing agricultural pests and illnesses that included GPT-4 and YOLOPC, but a greater accuracy rate was achieved. Despite the better accuracy rate, the problems of bias and error propagation in LLM could not be avoided. It has been demonstrated that LLM’s bias affects oscillations in output accuracy when used for agricultural jobs.
In conclusion, whereas LLM excels at general-purpose tasks in natural language processing, more specialized models should be utilized for specialized domains or jobs. It is now possible to make LLM capable of being applied to specialized domains or tasks through secondary training or combining it with prompt engineering, which is clearly a more cost-effective way of adapting to multiple downstream tasks than the costly secondary training method, which is difficult to adapt to multiple downstream tasks at once. Meanwhile, research has been performed to investigate the combination of LLM with rapid engineering for agricultural jobs, demonstrating that this technique is effective in the agricultural domain while also establishing that LLM bias might cause certain swings in accuracy. As a result, Chat-RGIE combines the existing research experience; in order to minimize the impact of bias on accuracy, multiple LLMs are allowed to participate in the data extraction process at the same time, and the attribute name and output format are strictly specified in the prompt; at the same time, to avoid that the participation of multiple LLMs will lead to the high cost of LLM specialization, the more cost-effective prompt engineering is chosen to carry out
Data extraction workflow
Overall process
The workflow of Chat-RGIE is shown in Fig. 1.The primary extracted material for Chat-RGIE is rice germplasm research paper abstracts, and the preliminary work is gathering research papers, exporting the title data, and cleaning it to guarantee that only sections of the paper abstracts remain. Since previous research have employed a single LLM to handle problems in the agricultural area, it is obvious that there is a certain bias in the LLM’s output when changing descriptive text, resulting in oscillations in accuracy [105].
[See PDF for image]
Fig. 1
Schematic diagram of the overall Chat-RGIE process. For each abstract, data extraction is performed by n LLMs, and the n+1th LLM, which is not involved in data extraction, compares all the extraction results to achieve validation
In the data extraction phase, Chat-RGIE firstly feeds the extracted summaries into n session LLMs (the value of n is not specified), and in each LLM, the summaries are combined with the pre-designed data extraction prompts and the raw data extraction results are obtained from the LLMs, which is called the RGIE block, in which it is explicitly stipulated that In the RGIE block, we specify that null values are allowed, when the value corresponding to a specified attribute cannot be recognized, the specified character “Not in Detail” is output, and when an abnormal value occurs during the extraction of a certain attribute, depending on the attribute, the specified string is requested to be output. In addition, although the number of LLMs used, n, is not specified, we believe that all the LLMs used should meet the capacity expectation in general, and a simple example is shown in Fig. 2.
[See PDF for image]
Fig. 2
The sum of the capacities of all LLMs shall meet all performance requirements (example)
When all n LLMs’ data extraction results are obtained, the abstract and raw data extraction results output from each LLM are combined with the answer scoring prompt words designed in advance and input into the n+1th LLM (which is not involved in the data extraction work) in order to select the optimal data extraction results, which is referred to as the Verify block.Obviously, Chat-RGIE is made up of n RGIE blocks and a Verify block, with the RGIE block carrying out data extraction from research paper abstracts and the Verify block reviewing the extraction findings acquired from the RGIE blocks.The RGIE block and the Verify block are critical Chat-RGIE components, and their details will be discussed in subsection 3.2.
Essential part
RGIE block
Figure 3 illustrates the flow of the RGIE block, which is a series of prompts used to command the large language model to extract the corresponding data fields in order to realize the data extraction of two parts of the rice germplasm: (1) rice germplasm source. This part extracts the variety name, parental source and related description of rice germplasm with a small number of short prompts; (2) Rice germplasm traits. One or more types of traits of rice germplasm may be involved in a research paper, and the RGIE block focuses on extracting some of the key traits of rice, including agronomic traits, quality traits, yield traits, and resistance traits. When RGIE block targets the extraction of rice germplasm sources, the information in the text will be extracted directly. The values corresponding to attributes such as name, resource name, and other names of rice are recognized and extracted respectively, and the data are returned as single values. Additionally, for different key words, we provide specific guidelines within the prompt. For example, common fields such as names are directily required to extracted, while fields that can be graded, such as resistance levels, are prompted within the prompt (e.g., 1. Not in detail; 2. Immunity; 3. High resistance; ...). See Appendix B for examples of specific prompt words.
[See PDF for image]
Fig. 3
RGIE block flow diagram
Verify block
The Verify block’s flow diagram is shown in Fig. 4. In order to achieve a comparison of the data extraction results of the nth LLMs, the n+1th LLM is also introduced to examine the data extraction results of the nth LLMs. The purpose of the Verify block is to further limit the potential of errors in the LLMs. A set of prompt words is utilized in the Verify block to grade each LLM’s data extraction outcomes. In order to select the best result from the data extraction results of several LLMs, a set of cue words is utilized in the Verify block to grade each LLM’s data extraction results. Assuming that all of the responses can be categorised into two groups?positive and negative responses?we ask LLMn+1 to compute the Euclidean distance between all of the responses in order to ascertain whether or not the data extraction results of all LLMs are consistent. The group’s responses were classified as positive responses if the majority of the LLMs’ responses were quite similar, and all of the responses were classified as negative responses if the opposite was true. Second, since all of the positive responses were rather similar to one another, they were combined into a single response. This will cut down on the number of questions posed while maintaining the same level of accuracy in the results. The combined positive responses as well as all of the negative responses were graded in the third stage. The exact score is ascertained by contrasting the original abstract and the question with the data retrieved from the integrated positive answer and all negative answers. Depending on the match, LLMn+1 will receive a value ranging from 0 to 100. Specifically, if the response matches the original text but not the question, the score will also be zero, with the exception of the scenario when the answer does not match both the question and the original summary text. The highest scoring answer is output directly if there is only one, and if not, LLMn+1 is asked to determine again which answer most closely matches the question and the original summary text before outputting the most suitable response. This process determines whether there is only one answer that receives the highest score. Furthermore, we added a tiny amount of text processing in the Verify block to prevent problems because the results supplied by the RGIE block contain nulls and outliers.
[See PDF for image]
Fig. 4
Verify block flow diagram
Additional details
We demand a return to the original description for all responses, but this does not exclude the possibility of producing fake facts. To confine LLM answers, we included some extra suggestions. These extra signals include explicitly expressing the likelihood that the original text does not contain the answer to some queries, requesting the return of a certain character if it cannot be recognized, and so on. It is worth noting that in order to automate the data extraction procedure, all question responses must be supplied as individual words or key-value combinations, which simplifies later data processing.
At the same time, to ensure that the model’s attention to textual details does not dwindle as the conversation progresses, we always provide the text with each data extraction, which improves the quality of the answers [26] and strengthens the possibility of responding in a short, structured format. It is worth noting that in our previous work, we used very strong statements to require LLMs to respond in the way we specified, and while there is no requirement to provide a non-empty answer, this leads to some explanation with redundant elaboration when LLMs are unable to extract the specified value from the text, or to answers that match the content but are irrelevant to the question, which can have an impact on the degree of answer specification and precision. Although it is unclear how strong a demand or prohibition can cause this difficulty, it is recommended that excessively strong requirements or prohibitions be avoided in prompts to ensure accuracy and consistency of responses.
To conclude, the prompt that we repeatedly deliver in each round of dialog is as follows: (1) Text to be extracted. (2) Accept and mildly encourage “Not in Detail” responses. (3) A clear response template is provided, and the response must be followed (a non-emotional declarative statement).
Obviously, we haven’t addressed all of the rice germplasm phenotyping keywords, and more optimization is possible. The data extraction prompts in Chat-RGIE are based on as broad a set of keywords as feasible, and we feel that the prompts offered here are a freely modifiable component of many data extraction jobs. The processes outlined in the Verify block can be tailored to specific jobs, as can the calculation of response similarity. Increase the calculation of similarity between answers and scoring criteria.
Experiment
Dataset
In this study, we conducted 1000 searches on the web of science using the core keywords “rice germplasm”, “Oryza sativa”, “rice varieties”, “germplasm”, “varieties”, “genetics”, and so on, to identify the top 1000 research papers, ordered in descending order of relevancy. Web of Science provides title information for the top 1000 research publications, ordered from highest to lowest relevance. We created criteria to extract text including only abstracts, from which we built the experimental dataset for this investigation.See Appendix A for the specific articles used.
Experiment settings
This section explains some of the fundamental setups we used for our research. Chat-RGIE, as a flexible data extraction system, has no internal data values that must be explicitly provided, and there is no limit on the number of LLMs that can be utilized in the RGIE and Verify blocks. However, in general, we suggest using only one LLM in the Verify block, and the LLM used in the Verify block should preferably outperform the LLM used in the RGIE block in some aspects; moreover, since we require the use of multiple LLMs in the RGIE block in order to reduce the possibility of the LLMs’ quality being affected by hallucinations through the comparison of answers, the possibility of hallucinations affecting the quality of answers, we believe that the LLMs used in an RGIE block should be different kinds of LLMs, and it is not recommended to use the same series of LLMs (e.g., it is not recommended to use ChatGLM2 and ChatGLM3 at the same time in an RGIE block). In this experiment, we used five LLMs within the RGIE block and one LLM in the Verify block.
The scores of the LLMs we used in the RGIE block and Verify block under the specified benchmarks are shown in Table 1.We analyzed the current LLMs in order to select the one to be used in this experiment.Since our main research material is abstracts of research papers, which have long text lengths, the criteria for selecting the LLMs, in addition to wanting the performance to be as good as possible, required that they support long text input. As good as possible, we also required it to support long text input. Taking the requirements of the rice germplasm resource data extraction task as a starting point, we hope that the LLM we use can have better language comprehension, logic, and reasoning ability, so we mainly focus on the model’s scores on the MMLU [107], HellaSwag [108], BBH [109], GSM8K [110], and MATH [111] when selecting the model we use. We compare the scores of a subset of current mainstream LLMs on these benchmarks, which are mainly from LLM technical reports or official websites, with a few from Internlm’s OpenCompass online evaluation results. We chose to use the better composite score Qwen2-70B-instruct in the Verify block for the evaluation scoring work, and the five LLMs with some differences between their capabilities were used in the RGIE block, namely Command-r-plus-104b, Qwen1.5-110B [112], Yi-34B [113], LLaMA3-70B [114], and Deepseek-v2-236B [115] (Table 2). Of these, Command-r-plus-104b lacks plausible scores on the BBH and MATH benchmarks, but we expect some variability in the LLMs used, so we chose to use this LLM in the RGIE block even in the absence of some of the scores.
Table 1. Experimental LLM selection
Model | Benchmark(Metric) | Context | ||||
|---|---|---|---|---|---|---|
MMLU | HellaSwag | BBH | GSM8K | MATH | ||
Command-r-plus-104b | 75.7 | 88.6 | – | 70.7 | – | 128k |
Qwen1.5-110B | 80.4 | 87.5 | 74.8 | 85.4 | 49.6 | 32k |
Yi-34B | 76.3 | 87.19 | 54.3 | 67.2 | 14.4 | 200k |
LLaMA3-70B | 79.5 | 88.0 | 76.6 | 79.2 | 41.0 | 8k |
Deepseek-v2-236B | 78.5 | 84.2 | 78.9 | 79.2 | 43.6 | 128k |
Qwen2-70B-instruct | 82.3 | 87.6 | 82.4 | 91.1 | 59.7 | 128k |
Table 2. Experimental LLM selection
Stage | Large language model | Temperature |
|---|---|---|
RGIE block | Command-r-plus-104b | 0.7 |
Qwen1.5-110b | 0.7 | |
Yi-34b | 0.7 | |
Llama3-70b | 0.7 | |
Deepseek-v2-236b | 0.7 | |
Verify block | qwen2-72b-instruct | 0.7 |
The experimental configurations of this paper are shown in Table 3. This experiment is conducted by deploying LLM locally and using it by calling the API interface, and the framework used is ollama (https://github.com/jmorganca/ollama). If you use the API interface provided by the commercial LLM service provider to make LLM calls, the hardware requirements only need to meet the recommended configuration of the corresponding service provider, and the hardware configuration used in this experiment is not a mandatory requirement for using Chat-RGIE.
Table 3. Experimental configuration
Hardware | Configurations |
|---|---|
CPU | Intel Xeon*2 |
Chipsets | Intel C741 |
GPU | NVIDIA RTX A6000 48G*4 |
Total Memory | 256GB |
Storage Type Capacity | 2TB+8TB |
Frame | Ollama |
Metric
As indicated in Sect. 1, in order for LLM to produce perfect data extraction outcomes, we must address the issues of controlled generation, specialized generation, bias, and illusion, in addition to LLM specialization. To clarify Chat-RGIE’s rice germplasm data extraction potential, we employ the assessment metrics Precision and Recall, which are typically used in data extraction tasks, to measure its competence on the rice germplasm data extraction work. In terms of language generation, we define Restraint rate as the degree to which Chat-RGIE constrains LLM output to be fine-grained or organized. For the possible bias and Hallucination, we refer to the existing studies [116] and introduce the metrics of Accuracy and “Hallucination” rate to evaluate the ability of the model to provide correct answers.
In our evaluation, we defined True Positive, True Negative, False Positive, and False Negative based on each input abstract text.For each extracted keyword, if the abstract was judged manually to contain no relevant content, and the RGIE also extracted 0 data, the case was defined as True Negative. every extracted data from an abstract that does not contain relevant data is considered a false positive. If 1 binary was extracted manually from the abstract and the data extracted by RGIE has more than one binary, they are compared with the manually extracted binary in turn, and the case is defined as True Positive if the two are equivalent, and False Positive if they are not.For each abstract, if the manually extracted data has been confirmed with a particular piece of data extracted by RGIE as For each abstract, if the manually extracted data has been confirmed to be equivalent to a piece of data extracted by RGIE, then even if there is a subsequent binary that agrees with the manually extracted data, it is determined to be False Positive. in the event that there are more than one value in the data extracted by Chat-RGIE, the use of synonyms is permitted, but all of them need to be correspondent (e.g., if 5 types of phenotypic data have been extracted by manually, all the five types of phenotypic data have to be recognized by RGIE as well).
Precision [117]is defined as the proportion of results predicted to be Positive that are Real Positive.
1
Recall [117]is defined as the proportion of all True Positive that are correctly predicted as Positive.2
Accuracy [116]is defined as the proportion of predicted answers that have any overlapping words with the true answer.3
The overlap function checks if there is a common word between the predicted answer and the correct answer. The is the predict answer, and N is the number of samples.Hallucination [116]is defined as the percentage of Hallucination answers out of all answers. Hallucination poses a significant challenge to the accuracy and reliability of data extraction, especially for those systems that use generative models. Although it is still necessary to manually review and calibrate the output results when using Chat-RGIE for rice germplasm data extraction, our goal is to minimize the time and labor costs involved in rice germplasm data extraction and automate the data extraction, so we would like to have the Hallucination Rate as low as possible in practical applications.
4
Where is the predict answer, and N is the number of samples.“” implies the emergence of answers that generate information that conflicts with existing sources (intrinsic illusions) or that cannot be verified by existing sources (extrinsic illusions).Restraint rate, which we define as the percentage of all responses that conform to the norms we designed. In this case, a response is considered to contain a sentence if it has a “subject-verb-object” structure.
5
“” implies a non-canonical answer, i.e., an answer that is not presented in the format we have specified.Result
Results of performance evaluation
We investigated the performance of the Chat-RGIE method on several performance examples, including rice and its parental names, characterization traits, agronomic traits, quality traits, yield traits, and resistance traits related to rice. We manually reviewed the data extracted by Chat-RGIE and calculated precision and recall based on the review results. At the same time, we randomly extracted 126 pieces of data and simply counted the time for Chat-RGIE to extract and manually review the results and the time consuming to extract completely manually, it took less than 8 h to use Chat-RGIE, while completely manually could not be done in less than 8 h, and Chat-RGIE possesses certain time efficiency. We calculated the Chat-RGIE extracted results based on the metric defined in 4.3, and the results are shown in Table 4, and examples of incorrect responses are shown in Table 5.
Table 4. Experimental configuration
Metric | Value |
|---|---|
Precision | 0.9102 |
Recall | 0.9941 |
Accuracy | 0.9554 |
Restraint Rate | 0.9265 |
Hallucination | 0.0015 |
Table 5. Types of errors for data extraction with large language models
Types of errors | Definitions |
|---|---|
Major error | Cannot differentiate well between keywords such as “agronomic traits,” “yield traits,” etc. The extracted content is correct but does not belong to the required type. |
Minor error | When a trait-related description does not exist, an attempt is made in the relevant question to explain why the data were not extracted. This is despite the fact that we have explicitly communicated the possibility of a no-answer scenario and that only specific characters will be answered in that case. |
rare error | The number of occurrences of these errors is very low. For example, answering that the parent of rice is “Farmers” or “Rice germplasm expert,” or identifying the abbreviation used in the text (which is not an abbreviation of the name of rice) as rice. |
Hallucination | LLM generates its own questions to answer in place of the ones we provide. |
Chat-RGIE performs better in performance evaluation, with Precision of 0.9102, Recall of 0.9941, and Accuracy of 0.9554. We find that the Recall value exceeds the Precision, and by analyzing the specific results, we find that the extraction of the text will still result in abnormal false positive responses (providing detailed explanations that cannot be extracted, or providing answers that match the original text but are not in line with the question) when there is no corresponding content in the text. By analyzing the specific results, we found that when the corresponding content does not exist in the text, there are still abnormal false positive responses (detailed explanations of what could not be extracted, or responses that match the original text but do not match the question), and these false positive responses affect Precision, while Recall is not affected by this kind of problem, which causes The difference between Precision and Recall reveals this problem, while the high Accuracy also proves that Chat-RGIE is effective and capable of providing rice germplasm information extraction with accuracy. The current restraint rate of Chat-RGIE also achieved a value of 0.9265, which is a good improvement compared to the restraint rate of 0.4533 in our previous experiment. In the previous experiment, we used a strict format constraint prompt, which contains words such as “forbidden”, which in turn stimulated LLMs to answer the question in sentences when they could not find the answer, violating our format constraints, but after removing the strong words and descriptions such as “forbidden”, the restraint rate was 0.9265, which is a good improvement compared to our previous experiment of 0.4533. However, when strong words and descriptions like “forbidden” were removed, LLM was able to answer the questions in the format we required.
It is worth noting that “hallucinations” also occur when no answers are found. Management theory suggests that when an organization or superior assigns a goal that is too difficult to achieve, employees may behave unethically under greater pressure. We observed a similar phenomenon when analyzing the results of Chat-RGIE. When questions were asked when no answers existed, LLMs in most cases tried to explain (i.e., false positive answers and violation of formatting constraints, as mentioned above), or to “make up” or even to change the question. We observed 2 instances in all extractions where LLMs generated their own questions and answered them; LLMs explained when they could not find an answer, proving that the lack of an answer was not of their own making, and generating a new question and answering it seemed to be a way of avoiding the lack of an answer, despite the fact that we have already emphasized that there is a possibility of a lack of an answer and that it requires a direct answer to the specified character. There seems to be strong “pressure” on LLMs not to answer negatively, and even to behave in a way we don’t expect in order to avoid negative answers, even though negative answers are what is required. Even if it is stated in the prompt that negative answers are allowed, this “pressure” cannot be relieved. Currently, the frequency of “hallucinations” in Chat-RGIE is only 0.0015, and since questions are always re-presented as sentences when they are made up or changed in the performance evaluation results, it is easy to filter out these anomalous responses in all the extracted results so far. Therefore, it is currently easy to filter out these anomalous responses from all the extracted results, and these errors do not pose much of a problem for the time being in terms of the results of the performance evaluation. These issues do not appear to be completely unavoidable at this point, and it is possible that the number of hallucinations could be further reduced if the data to be extracted were further segmented to ensure that the majority of summaries contained the data fields to be extracted, avoiding the possibility of negative responses.
In addition, as mentioned above, when the desired answer cannot be extracted, especially if strong constraints are received, the LLM will violate the formatting constraints, which are also very strong, or even fabricate facts in order to answer the question, leading to fluctuations in the precision as well as “Hallucination”. This leads to precision fluctuations and “Hallucination”. Combined with human behavior, we believe that this phenomenon is similar to human behavior when facing high levels of stress. In our experiments, the restraint rate is 0.4533 when using strongly prohibited statements, and 0.9265 when removing the strongly prohibited coloring, which proves that there is a correlation between these strongly prohibited constraints and LLM’s violation of constraints, and that the more prohibited the more LLM will be induced to violate the rules, which further confirms that LLM will feel the “Hallucination” and the “Stress”. further confirms that LLMs feel pressure and behave similarly to humans. For the future use of LLM, reducing the number of words with strong prohibitions can improve the normality and precision of LLM responses and reduce the occurrence of Hallucination. For the current discussion related to the morality and ethics of LLM, this finding at least demonstrates that LLM is also affected by stress and behaves accordingly, and whether this is a result of LLM’s ability to learn human behaviors from the data or generalize them, it demonstrates that LLM has the potential to understand human feelings under stress and such subjectively colored human emotions. Upon recognizing that LLM does possess the ability to truly understand human emotions rather than mechanically predicting the next token, we believe that the question about clarifying how LLM can possess such abilities becomes a worthwhile discussion about whether the generation of such behaviors similar to human behaviors is irreversible and uncontrollable. If it is controllable, should it be a choice to allow it to master human feelings, or should this development be inhibited, and how would these choices affect human society.
Results of real-life data extraction evaluation
There is a phenomenon that LLM performs well in the experimental stage but poorly in real production, so we evaluate the use of Chat-RGIE in real production scenarios. Phenotypic data for yield-related traits are of great importance in rice breeding. Such phenotypic data can be used to predict phenotypic values for different planting locations and help to assist breeders in evaluating the most suitable planting locations for a particular material. Therefore, in this study, based on the actual production scenario, the yield data in the abstract of the paper were targeted to be extracted in order to validate the capability of the Chat-RGIE method in real-life. The values corresponding to the specified traits were extracted directly from the thesis abstract by the Chat-RGIE method, and all the values were in the form of single-valued standards. As some of the phenotypic data values for yield-related traits also contain units, some further text processing using regular expressions is required. Although this method may not be completely accurate, it can be processed using manual review only at the final stage, requiring less manual effort.
We used Chat-RGIE to extract yield-related data from a batch of research papers and manually reviewed the extraction results, which are shown in Table 6. In this task, although the precision of Chat-RGIE is only 0.6351, the recall is 1.0, and there are no data judged as False Negative in the extraction results, which means that yield information was not missed in the extraction. Since there are a certain amount of abstracts that do not contain yield information in the abstracts of the papers we provided, when extracting yield information from these abstracts, LLM will answer some contents that are not related to the question, which are often also correct structured data from the original text of the abstracts, and only do not match with the question, and these contents are determined as False Positive by us, which may lead to the precision to be lower than recall. The Accuracy of 0.8225 also validates the validity of Chat-RGIE in real-life. As the result of 4.4.1, LLM tries to avoid negative responses, and even if negative responses are given, some “compensation” is made. In this case, this “compensation”, in addition to explaining the negative answer, gives rise to a new type of answer in real-life data extraction evaluation: providing all the data fields that can be extracted in the abstract, even though they have nothing to do with the yield we are asking for. This also resulted in more false positive responses than in 4.4.1, and a lower precision than that of the performance evaluation, but this does not mean that Chat-RGIE performs completely worse than the performance evaluation in real-life, as the performance evaluation in this part of the experiment was not as good as the performance evaluation. evaluation, in this part of the experiment Chat-RGIE exhibits a higher Recall value. Overall, Chat-RGIE’s performance in real-life data extraction experiments is consistent with that in Performance evaluation, and both show the effectiveness of its data extraction, although there are some problems.
Table 6. Experimental configuration
Metric | Value |
|---|---|
Precision | 0.6351 |
Recall | 1.0 |
Accuracy | 0.8225 |
Restraint Rate | 0.911 |
Hallucination | 0.005 |
Obviously, the likelihood of encountering null values during extraction will be higher in real production scenarios than in controlled environments, due to the fact that it is not yet possible to completely avoid the problem of LLM in generation, i.e., when encountering unanswerable questions for interpretation, which can lead to fluctuations in the precision. Therefore, when using Chat-RGIE for data extraction in practical scenarios, it is recommended to use high-quality data as much as possible to reduce the possibility of answering null values in order to ensure the precision of data extraction.
Conclusion
In this paper, we demonstrate that by constructing suitable prompt words and attaching certain constraints (such as the Chat-RGIE method presented in this paper), we can realize the extraction of high-quality rice germplasm data from the abstracts of research papers without additional work such as model fine-tuning training. In this paper, we propose and demonstrate the effectiveness of a set of well-designed cues for extracting key data and a set of cues for verifying the accuracy of the responses, and the Chat-RGIE method obtains a precision of 0.9102, a recall of 0.9941, and an accuracy of 0.9554 on our test set of rice germplasm data, and the “hallucination” is a good example of the use of the “hallucination” method to extract high-quality data from the abstracts of research papers. The results suggest that the Chat-RGIE method has a high quality data extraction capability and a low frequency of “hallucinations”. Further, we tested the ability of Chat-RGIE to extract yield information as an attempt to develop a rice yield database using the Chat-RGIE method.Chat-RGIE obtained a precision of 0.6351, a recall of 1.0, and an accuracy of 0.8225 in the yield extraction task. precision was lower than the precision on the test set because there were responses that did not appear on the test set, in the face of the inability to extract an answer, which resulted in more False Positive responses, and in this task Chat-RGIE also achieved a 1.0 recall and 0.8225 accuracy, with no judgment for the In this study, we found that LLMs showed clear and strong “avoidance” of giving negative responses when analyzing data that required negative responses, including providing additional explanations for negative responses, creating their own questions for negative responses, and giving a negative response to a question. These behaviors are very similar to human behaviors in the face of stress, including providing additional explanations while providing a negative answer, and creating their own questions to answer. The relative frequency of such behaviors in LLMs changed when we used prompters with different levels of emphasis on “forbidden” behaviors, so we identified a correlation between the level of pressure exerted on LLMs in the prompter and the degree to which LLMs violated the constraints, with the higher the level of pressure, the more likely LLMs were to violate the constraints, and the more likely LLMs were to violate the constraints when under pressure. Consistent with the likelihood that humans will engage in unethical behavior in the face of stress.
Based on our findings and observations of the experimental process, we suggest that when using prompts to ask LLMs to complete a task, in order to ensure the normality of the responses and to avoid the creation of hallucinations, we should avoid the use of strong prohibitions such as “strictly prohibit ” as much as possible, so as to avoid the possibility that LLMs may be pressurized into producing behaviors we do not want them to produce. The possibility of LLM producing behaviors we do not expect due to pressure is avoided. In addition, it is clear that LLM responds to stress in a manner similar to human behavior under stress. It is not yet clear whether this is a behavioral pattern learned from the data during the pre-training phase of LLM or a generalization of its own abilities, but it is possible to understand that LLM at least possesses the potential to truly understand human emotions. We believe that further in-depth understanding of the principles of LLM, clarifying the mechanism of generating this kind of human behavior, and determining whether this is an inevitable trend and whether it is controllable are of great significance to the direction of further development of LLM in the future. If it is controllable, whether we should let LLM master human feelings or not will involve the issue of moral and ethical discussion related to AI. For the time being, however, if reducing stress in prompts improves LLM’s precision and response normativity, then it can be argued that the use of a number of other techniques in prompts that are applicable to human behavior holds promise for further improving LLM’s performance. Although Chat-RGIE has made some progress, accuracy remains a challenge under non-ideal conditions. Future research will explore strategies to adapt to diverse real-life data in order to address scalability and data quality issues in practical scenarios.
In summary, Chat-RGIE ensures data extraction with a certain quality and provides a simple method of implementation. This suggests that data extraction using LLMs has the ability to replace, to some extent, the opportunities of previous more labor-intensive methods. Since Chat-RGIE does not change to a large extent depending on the LLM used, it is also simple to apply it to future LLMs.
Author contributions
Y.W. studied the related literature, implemented the proposed framework, conducted the experiments, and prepared the manuscript. J.F. designed the proposed framework, reviewed, and prepared the manuscript. All authors read and approved the final manuscript.
Funding
This research was supported by the Science and technology innovation project of Chinese Academy of Agricultural Sciences (No. CAAS-ASTIP-2025-AII).
Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Code availability
The code is available at https://github.com/mangx/Chat-RGIE.
Declarations
Competing interests
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Ray, DK; Ramankutty, N; Mueller, ND; West, PC; Foley, JA. Recent patterns of crop yield growth and stagnation. Nat Commun; 2012; 3,
2. Ray, DK; Mueller, ND; West, PC; Foley, JA. Yield trends are insufficient to double global crop production by 2050. PLoS ONE; 2013; 8,
3. Szachniuk, M. Rnapolis: computational platform for rna structure analysis. Found Comput Decis Sci; 2019; 44,
4. Kurata, N; Yamazaki, YO. An integrated biological and genome information database for rice. Plant Physiol; 2006; 140,
5. Swain, MC; Cole, JM. Chemdataextractor: a toolkit for automated extraction of chemical information from the scientific literature. J Chem Inf Model; 2016; 56,
6. Mavracic, J; Court, CJ; Isazawa, T; Elliott, SR; Cole, JM. Chemdataextractor 2.0: autopopulated ontologies for materials science. J Chem Inf Model; 2021; 61,
7. Court, CJ; Cole, JM. Magnetic and superconducting phase diagrams and transition temperatures predicted using text mining and machine learning. NPJ Comput Mater; 2020; 6,
8. Kumar, P; Kabra, S; Cole, JM. Auto-generating databases of yield strength and grain size using chemdataextractor. Sci Data; 2022; 9,
9. Sierepeklis, O; Cole, JM. A thermoelectric materials database auto-generated from the scientific literature using chemdataextractor. Sci Data; 2022; 9,
10. Zhao, J; Cole, JM. Reconstructing chromatic-dispersion relations and predicting refractive indices using text mining and machine learning. J Chem Inf Model; 2022; 62,
11. Zhao, J; Cole, JM. A database of refractive indices and dielectric constants auto-generated using chemdataextractor. Sci Data; 2022; 9,
12. Beard, EJ; Cole, JM. Perovskite-and dye-sensitized solar-cell device databases auto-generated using chemdataextractor. Sci Data; 2022; 9,
13. Dong, Q; Cole, JM. Auto-generated database of semiconductor band gaps using chemdataextractor. Sci Data; 2022; 9,
14. Beard, EJ; Sivaraman, G; Vázquez-Mayagoitia, Á; Vishwanath, V; Cole, JM. Comparative dataset of experimental and computational attributes of uv/vis absorption spectra. Sci Data; 2019; 6,
15. Wang, Z; Kononova, O; Cruse, K; He, T; Huo, H; Fei, Y; Zeng, Y; Sun, Y; Cai, Z; Sun, W et al. Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature. Sci Data; 2022; 9,
16. Huo, H; Bartel, CJ; He, T; Trewartha, A; Dunn, A; Ouyang, B; Jain, A; Ceder, G. Machine-learning rationalization and prediction of solid-state synthesis conditions. Chem Mater; 2022; 34,
17. Saal, JE; Oliynyk, AO; Meredig, B. Machine learning in materials discovery: confirmed predictions and their underlying approaches. Annu Rev Mater Res; 2020; 50, pp. 49-69.
18. Morgan, D; Jacobs, R. Opportunities and challenges for machine learning in materials science. Annu Rev Mater Res; 2020; 50,
19. Karpovich C, Jensen Z, Venugopal V, Olivetti E. Inorganic synthesis reaction condition prediction with generative machine learning. arXiv preprint. 2021. arXiv:2112.09612
20. Georgescu, AB; Ren, P; Toland, AR; Zhang, S; Miller, KD; Apley, DW; Olivetti, EA; Wagner, N; Rondinelli, JM. Database, features, and machine learning model to identify thermally driven metal-insulator transition compounds. Chem Mater; 2021; 33,
21. Kononova, O; He, T; Huo, H; Trewartha, A; Olivetti, EA; Ceder, G. Opportunities and challenges of text mining in materials research. Iscience; 2021; 24,
22. Kim, E; Jensen, Z; Grootel, A; Huang, K; Staib, M; Mysore, S; Chang, H-S; Strubell, E; McCallum, A; Jegelka, S et al. Inorganic materials synthesis planning with literature-trained neural networks. J Chem Inf Model; 2020; 60,
23. Kim, E; Huang, K; Saunders, A; McCallum, A; Ceder, G; Olivetti, E. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem Mater; 2017; 29,
24. Jensen, Z; Kim, E; Kwon, S; Gani, TZ; Román-Leshkov, Y; Moliner, M; Corma, A; Olivetti, E. A machine learning approach to zeolite synthesis enabled by automatic literature data extraction. ACS Cent Sci; 2019; 5,
25. Gilligan, LP; Cobelli, M; Taufour, V; Sanvito, S. A rule-free workflow for the automated generation of databases from scientific literature. NPJ Comput Mater; 2023; 9,
26. Polak, MP; Morgan, D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat Commun; 2024; 15,
27. Brown, T; Mann, B; Ryder, N; Subbiah, M; Kaplan, JD; Dhariwal, P; Neelakantan, A; Shyam, P; Sastry, G; Askell, A et al. Language models are few-shot learners. Adv Neural Inf Process Syst; 2020; 33, pp. 1877-1901.
28. Ouyang, L; Wu, J; Jiang, X; Almeida, D; Wainwright, C; Mishkin, P; Zhang, C; Agarwal, S; Slama, K; Ray, A et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst; 2022; 35, pp. 27730-27744.
29. Workshop B, Scao TL, Fan A, Akiki C, Pavlick E, Ilić S, Hesslow D, Castagné R, Luccioni AS, Yvon F, et al. Bloom: A 176b-parameter open-access multilingual language model. Preprint at. 2022. arXiv:2211.05100
30. Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, Dewan C, Diab M, Li X, Lin XV, et al. Opt: Open pre-trained transformer language models. Preprint at. 2022. arXiv:2205.01068
31. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, et al. Llama: Open and efficient foundation language models. Preprint at. 2023. arXiv:2302.13971
32. Dunn A, Dagdelen J, Walker N, Lee S, Rosen AS, Ceder G, Persson K, Jain A. Structured information extraction from complex scientific text with fine-tuned large language models. Preprint at. 2022. arXiv:2212.05238
33. Polak, MP; Modi, S; Latosinska, A; Zhang, J; Wang, C-W; Wang, S; Hazra, AD; Morgan, D. Flexible, model-agnostic method for materials data extraction from text using general purpose language models. Digit Discov; 2024; 3,
34. Qin C, Zhang A, Zhang Z, Chen J, Yasunaga M, Yang D. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint. 2023. arXiv:2302.06476
35. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, et al. Gpt-4 technical report. Preprint at. 2023. arXiv:2303.08774
36. Han, X; Zhang, Z; Ding, N; Gu, Y; Liu, X; Huo, Y; Qiu, J; Yao, Y; Zhang, A; Zhang, L et al. Pre-trained models: past, present and future. AI Open; 2021; 2, pp. 225-250.
37. Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, Yogatama D, Bosma M, Zhou D, Metzler D, et al. Emergent abilities of large language models. arXiv preprint. 2022. arXiv:2206.07682
38. Li, J; Tang, T; Zhao, WX; Nie, J-Y; Wen, J-R. Pre-trained language models for text generation: a survey. ACM Comput Surv; 2024; 56,
39. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute-optimal large language models. arXiv preprint. 2022. arXiv:2203.15556
40. Bian J, Zheng J, Zhang Y, Zhu S. Inspire the large language model by external knowledge on biomedical named entity recognition. arXiv preprint. 2023. arXiv:2309.12278
41. Polak, MP; Morgan, D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat Commun; 2024; 15,
42. Gilardi, F; Alizadeh, M; Kubli, M. Chatgpt outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci; 2023; 120,
43. Huang J, Gu SS, Hou L, Wu Y, Wang X, Yu H, Han J. Large language models can self-improve. arXiv preprint. 2022. arXiv:2210.11610
44. Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, Lee P, Lee YT, Li Y, Lundberg S, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint. 2023. arXiv:2303.12712
45. Fu Y, Peng H, Khot T. How does gpt obtain its ability? tracing emergent abilities of language models to their sources. Yao Fu’s Notion. 2022.
46. Ye J, Chen X, Xu N, Zu C, Shao Z, Liu S, Cui Y, Zhou Z, Gong C, Shen Y, et al. A comprehensive capability analysis of gpt-3 and gpt-3.5 series models. arXiv preprint. 2023. arXiv:2303.10420
47. McCloskey, M; Cohen, NJ. Catastrophic interference in connectionist networks: the sequential learning problem. Psychology of learning and motivation; 1989; Amsterdam, Elsevier: pp. 109-165.
48. Kemker R, McClure M, Abitino A, Hayes T, Kanan C. Measuring catastrophic forgetting in neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018; vol. 32
49. Ouyang, L; Wu, J; Jiang, X; Almeida, D; Wainwright, C; Mishkin, P; Zhang, C; Agarwal, S; Slama, K; Ray, A et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst; 2022; 35, pp. 27730-27744.
50. Bang Y, Cahyawijaya S, Lee N, Dai W, Su D, Wilie B, Lovenia H, Ji Z, Yu T, Chung W, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint. 2023. arXiv:2302.04023
51. Lu P, Qiu L, Yu W, Welleck S, Chang K-W. A survey of deep learning for mathematical reasoning. arXiv preprint. 2022. arXiv:2212.10535
52. Qian J, Wang H, Li Z, Li S, Yan X. Limitations of language models in arithmetic and symbolic induction. arXiv preprint. 2022. arXiv:2208.05051
53. Wei, J; Wang, X; Schuurmans, D; Bosma, M; Xia, F; Chi, E; Le, QV; Zhou, D et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst; 2022; 35, pp. 24824-24837.
54. Lyu Q, Havaldar S, Stein A, Zhang L, Rao D, Wong E, Apidianaki M, Callison-Burch C. Faithful chain-of-thought reasoning. arXiv preprint. 2023. arXiv:2301.13379
55. Patel A, Bhattamishra S, Goyal N. Are nlp models really able to solve simple math word problems? arXiv preprint. 2021. arXiv:2103.07191
56. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, et al. Large language models encode clinical knowledge. arXiv preprint. 2022. arXiv:2212.13138
57. Lin, Z. How to write effective prompts for large language models. Nat Hum Behav; 2024; 8,
58. Liu, P; Yuan, W; Fu, J; Jiang, Z; Hayashi, H; Neubig, G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv; 2023; 55,
59. Yao, S; Yu, D; Zhao, J; Shafran, I; Griffiths, T; Cao, Y; Narasimhan, K. Tree of thoughts: deliberate problem solving with large language models. Adv Neural Inf Process Syst; 2024; 36, pp. 11809-22.
60. Besta M, Blach N, Kubicek A, Gerstenberger R, Podstawski M, Gianinazzi L, Gajda J, Lehmann T, Niewiadomski H, Nyczyk P, et al. Graph of thoughts: Solving elaborate problems with large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2024; vol. 38, pp. 17682–17690
61. Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K, Cao Y. React: Synergizing reasoning and acting in language models. arXiv preprint. 2022. arXiv:2210.03629
62. Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, Chowdhery A, Zhou D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint. 2022. arXiv:2203.11171
63. Brown TB. Language models are few-shot learners. arXiv preprint. 2020. arXiv:2005.14165
64. Humphreys K, Gaizauskas R, Azzam S, Huyck C, Mitchell B, Cunningham H, Wilks Y. University of sheffield: Description of the lasie-ii system as used for muc-7. In: Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, 1998.
65. Krupka G, IsoQuest K. Description of the nerowl extractor system as used for muc-7. In: Proc. 7th Message Understanding Conf, 2005; pp. 21–28
66. Rabiner, L; Juang, B. An introduction to hidden markov models. IEEE ASSP Mag; 1986; 3,
67. Kapur, JN. Maximum-entropy models in science and engineering; 1989; Hoboken, John Wiley & Sons:
68. Burges, CJ. A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc; 1998; 2,
69. Lafferty J, McCallum A, Pereira F, et al. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Icml. Williamstown, MA, 2001; vol. 1, p. 3.
70. Sampaio, PS; Almeida, AS; Brites, CM. Use of artificial neural network model for rice quality prediction based on grain physical parameters. Foods; 2021; 10,
71. Tiozon, RJN; Sreenivasulu, N; Alseekh, S; Sartagoda, KJD; Usadel, B; Fernie, AR. Metabolomics and machine learning technique revealed that germination enhances the multi-nutritional properties of pigmented rice. Commun Biol; 2023; 6,
72. Baevski A, Edunov S, Liu Y, Zettlemoyer L, Auli M. Cloze-driven pretraining of self-attention networks. arXiv preprint. 2019. arXiv:1903.07785
73. Collobert, R; Weston, J; Bottou, L; Karlen, M; Kavukcuoglu, K; Kuksa, P. Natural language processing (almost) from scratch. J Mach Learn Res; 2011; 12, pp. 2493-2537.
74. Devlin J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint. 2018. arXiv:1810.04805
75. Bharadwaj A, Mortensen DR, Dyer C, Carbonell JG. Phonologically aware neural model for named entity recognition in low resource transfer settings. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016; pp. 1462–1472
76. Rei M, Crichton GK, Pyysalo S. Attending to characters in neural sequence labeling models. arXiv preprint. 2016. arXiv:1611.04361
77. Kumar, R; Khatri, A; Acharya, V. Deep learning uncovers distinct behavior of rice network to pathogens response. Iscience; 2022; 25,
78. Vourlaki, I-T; Ramos-Onsins, SE; Pérez-Enciso, M; Castanera, R. Evaluation of deep learning for predicting rice traits using structural and single-nucleotide genomic variants. Plant Methods; 2024; 20,
79. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, et al. Gpt-4 technical report. Preprint at. 2023. arXiv:2303.08774
80. Radford, A; Wu, J; Child, R; Luan, D; Amodei, D; Sutskever, I et al. Language models are unsupervised multitask learners. OpenAI Blog; 2019; 1,
81. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, et al. Llama 2: Open foundation and fine-tuned chat models. Preprint at. 2023. arXiv:2307.09288
82. Min, B; Ross, H; Sulem, E; Veyseh, APB; Nguyen, TH; Sainz, O; Agirre, E; Heintz, I; Roth, D. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv; 2023; 56,
83. Yoo KM, Park D, Kang J, Lee S-W, Park W. Gpt3mix: Leveraging large-scale language models for text augmentation. Preprint at. 2021. arXiv:2104.08826
84. Albrecht J, Kitanidis E, Fetterman AJ. Despite” super-human” performance, current LLMs are unsuited for decisions about ethics and safety. Preprint at. 2022. arXiv:2212.06295
85. Liang PP, Wu C, Morency L-P, Salakhutdinov R. Towards understanding and mitigating social biases in language models. In: International Conference on Machine Learning, PMLR. 2021; pp. 6565–6576
86. Alvi M, Zisserman A, Nellåker C. Turning a blind eye: Explicit removal of biases and variation from deep neural network embeddings. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018; pp. 0–0.
87. Zhang J, Verma V. Discover discriminatory bias in high accuracy models embedded in machine learning algorithms. In: The International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, Springer, 2020; pp. 1537–1545.
88. Hajikhani A, Cole C. A critical review of large language models: sensitivity, bias, and the path toward specialized ai. Quantitative Science Studies, 2024; 1–22
89. Gartlehner G, Kahwati L, Hilscher R, Thomas I, Kugley S, Crotty K, Viswanathan M, Nussbaumer-Streit B, Booth G, Erskine N, et al. Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Research Synthesis Methods. 2024.
90. Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. Preprint at. 2019. arXiv:1903.10676
91. Lewis P, Ott M, Du J, Stoyanov V. Pretrained language models for biomedical and clinical tasks: understanding and extending the state-of-the-art. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop, 2020; pp. 146–157
92. Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., Le, Q.V.: Finetuned language models are zero-shot learners. arXiv preprint. 2021. arXiv:2109.01652
93. Sanh V, Webson A, Raffel C, Bach S, Sutawika L, Alyafeai Z, Chaffin A, Stiegler A, Raja A, Dey M, et al. Multitask prompted training enables zero-shot task generalization. In: International Conference on Learning Representations. 2022
94. Chung, HW; Hou, L; Longpre, S; Zoph, B; Tay, Y; Fedus, W; Li, Y; Wang, X; Dehghani, M; Brahma, S et al. Scaling instruction-finetuned language models. J Mach Learn Res; 2024; 25,
95. Kenton Z, Everitt T, Weidinger L, Gabriel I, Mikulik V, Irving G. Alignment of language agents. arXiv preprint. 2021. arXiv:2103.14659
96. Askell A, Bai Y, Chen A, Drain D, Ganguli D, Henighan T, Jones A, Joseph N, Mann B, DasSarma N, et al. A general language assistant as a laboratory for alignment. arXiv preprint. 2021. arXiv:2112.00861
97. Bai Y, Jones A, Ndousse K, Askell A, Chen A, DasSarma N, Drain D, Fort S, Ganguli D, Henighan T, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint. 2022. arXiv:2204.05862
98. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. Lora: Low-rank adaptation of large language models. arXiv preprint. 2021. arXiv:2106.09685
99. Hu Z, Wang L, Lan Y, Xu W, Lim E-P, Bing L, Xu X, Poria S, Lee RK-W. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint. 2023. arXiv:2304.01933
100. Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, Attariyan M, Gelly S. Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning, PMLR, 2019; pp. 2790–2799
101. He J, Zhou C, Ma X, Berg-Kirkpatrick T, Neubig G. Towards a unified view of parameter-efficient transfer learning. arXiv preprint. 2021. arXiv:2110.04366
102. Pfeiffer J, Vulić I, Gurevych I, Ruder S.: Mad-x: An adapter-based framework for multi-task cross-lingual transfer. arXiv preprint. 2020. arXiv:2005.00052
103. Lin, Z. How to write effective prompts for large language models. Nat Hum Behav; 2024; 8,
104. Zhao, B; Jin, W; Del Ser, J; Yang, G. Chatagri: Exploring potentials of ChatGPT on cross-linguistic agricultural text classification. Neurocomputing; 2023; 557, 126708.
105. Peng R, Liu K, Yang P, Yuan Z, Li S. Embedding-based retrieval with llm for effective agriculture information extracting from unstructured data. arXiv preprint. 2023. arXiv:2308.03107
106. Qing, J; Deng, X; Lan, Y; Li, Z. Gpt-aided diagnosis on agricultural image based on a new light yolopc. Comput Electron Agric; 2023; 213, 108168.
107. Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. Preprint at. 2020. arXiv:2009.03300
108. Zellers R, Holtzman A, Bisk Y, Farhadi A, Choi Y. Hellaswag: can a machine really finish your sentence? Preprint at. 2019. arXiv:1905.07830
109. Suzgun M, Scales N, Schärli N, Gehrmann S, Tay Y, Chung HW, Chowdhery A, Le QV, Chi EH, Zhou D, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. Preprint at. 2022. arXiv:2210.09261
110. Cobbe K, Kosaraju V, Bavarian M, Chen M, Jun H, Kaiser L, Plappert M, Tworek J, Hilton J, Nakano R, et al. Training verifiers to solve math word problems. Preprint at. 2021. arXiv:2110.14168
111. Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, Song D, Steinhardt J. Measuring mathematical problem solving with the math dataset. Preprint at. 2021. arXiv:2103.03874
112. Bai J, Bai S, Chu Y, Cui Z, Dang K, Deng X, Fan Y, Ge W, Han Y, Huang F, et al. Qwen technical report. arXiv preprint. 2023. arXiv:2309.16609
113. Young A, Chen B, Li C, Huang C, Zhang G, Zhang G, Li H, Zhu J, Chen J, Chang J, et al. Yi: Open foundation models by 01. ai. arXiv preprint. 2024. arXiv:2403.04652
114. Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, Letman A, Mathur A, Schelten A, Yang A, Fan A, et al. The llama 3 herd of models. arXiv preprint. 2024. arXiv:2407.21783
115. Liu A, Feng B, Wang B, Wang B, Liu B, Zhao C, Dengr C, Ruan C, Dai D, Guo D, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint. 2024. arXiv:2405.04434
116. Sanmartin D. KG-RAG: Bridging the Gap Between Knowledge and Creativity. Preprint at. 2024. arXiv:2405.12035
117. Powers DM. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. arXiv preprint. 2020. arXiv:2010.16061
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.