Document 1 of 1

More like this

Full Text
Scholarly Journal

Full Text

Turn on search term navigation

1. Introduction

Traditional approaches for acquiring pest and disease management information have several limitations in practical applications. The main reason is that such knowledge relies on the expertise of agricultural experts or static information from textbooks, which is often challenging to adapt to local environmental conditions and protection needs [1]. The same control methods may be effective in one region but pose serious risks to ecosystems or human health in another, particularly in areas with water sources or nearby communities. This lack of adaptability further exacerbates the inefficiency of traditional pest management methods. Additionally, farmers have limited access to key information, often relying on direct consultations with agricultural experts. This method of dissemination is inefficient, with limited coverage, and unable to provide timely responses during large-scale pest outbreaks [2]. For example, in 1984, Shaanxi Province encountered the American white moth for the first time, and due to delayed information and reliance on outdated practices, local farmers were unable to adopt effective control measures in time, resulting in significant agricultural losses. In recent years, the application of deep learning technologies to pest and disease management has garnered significant attention from researchers. Deep learning methods, particularly those based on Transformer models and knowledge graphs, have proven effective in addressing the challenges of pest and disease detection. For example, edge computing based on Transformer and knowledge graphs has been used for intelligent cotton pest and disease detection [3]; neural architecture search combined with knowledge graphs has been applied to chili disease detection [4]; and crop pest and disease named entity recognition has employed gated fusion units and Manhattan attention methods [5]. Pest and disease management systems based on these technologies can cover a wide range of agricultural information. However, despite the promising developments, several challenges remain in the implementation of these technologies. One major limitation is that the creation and expansion of large-scale structured data heavily depend on entity relationship annotation, a process that is both time-consuming and resource-intensive. Annotating the relationships between various entities—such as pests, diseases, crops, and environmental conditions—requires significant expertise and human labor, which can limit the scalability of these systems, especially in regions with limited access to domain experts. Moreover, traditional knowledge graphs have inherent limitations in reasoning and analysis capabilities, making it difficult to dynamically adjust and provide personalized advice based on current environmental conditions and crop growth status [6]. Although knowledge graphs excel in modeling structured data, they struggle to integrate embedded textual information effectively. This limitation prevents them from performing more advanced tasks, such as knowledge graph completion or dynamic question answering, which require a deeper understanding of context beyond basic entity relationships. Furthermore, the fixed nature of entity relationships in traditional knowledge graphs makes it difficult for the system to adapt to new or changing conditions. In pest and disease management, understanding context—such as environmental factors, crop health, and seasonal changes—is essential for providing accurate and personalized advice. The inability to process unstructured data, combined with rigid entity relationships, reduces the system’s adaptability and its ability to offer timely, relevant recommendations.

Large language models (LLMs), such as GPT [7], Llama [8], and BLOOM [9], have excelled in a wide range of tasks in the field of natural language processing (NLP), particularly in language understanding and text generation, showcasing remarkable abilities [10]. They provide more accurate reasoning results in tasks such as machine translation [11], grammar correction [12], and summarization [13]. Through dynamic adjustments and self-optimization, LLMs can more effectively adapt to different application scenarios, overcoming the inherent limitations of traditional methods in real-time adaptability and reasoning abilities. Although LLMs can perform a wide range of general language tasks, their training datasets have not fully covered the key knowledge and terminology of specialized fields [14]. Therefore, they are unable to handle certain specialized domain problems [15,16,17,18]. Particularly when dealing with agricultural pest- and disease-related issues, LLMs often fail to fully understand the professional terms and concepts in the field, leading to ambiguities and misunderstandings [19]. Additionally, the diversity of Chinese expressions exacerbates this issue. The use of polysemy and frequent abbreviations in specific fields are common in Chinese. These challenges highlight the limitations of LLMs when dealing with professional tasks in the Chinese language.

To address these issues, this paper focuses on the field of agricultural pest and disease management and proposes and develops a Chinese LLM specifically designed for pest and disease knowledge—IPM-AgriGPT (Version 1.0) (Integrated Pest Management—Agricultural Generative Pre-Trained Transformer). The main contributions of IPM-AgriGPT include the following:

Proposing a Generation-Evaluation Adversarial (G-EA) framework to improve data processing efficiency and ensure the quality of the question-answering corpus.
Applying the Agricultural Contextual Reasoning Chain-of-Thought Distillation (ACR-CoTD), which transfers the reasoning process from the teacher model to the student model, enhancing the student model’s reasoning ability and overall performance in complex tasks.
Using low-rank adaptation (LoRA) for supervised fine-tuning to improve the base model’s understanding of agricultural pest and disease knowledge.
Constructing an evaluation benchmark for the agricultural pest and disease domain to comprehensively assess the capabilities of LLMs in this field.

2. Related Work

LLMs are deep learning-based NLP models specifically designed to understand and generate natural language text. With the emergence of open-source LLMs, this has greatly stimulated researchers’ enthusiasm for LLMs. To address the challenges of training with limited resources, some researchers have proposed parameter-efficient fine-tuning techniques, such as adapter tuning [20], LoRA [21], AdaLoRA [22], and QLoRA [23]. These methods can enhance the model’s performance on specific tasks without retraining the entire model.

Nowadays, LLMs have been widely applied across various specialized fields. For instance, Zhang et al. developed BB-GeoGPT [24], which significantly improves the accuracy of geographic information tasks by constructing a pre-training dataset specifically for geographic information science (GIS) and conducting supervised fine-tuning (SFT). Wang et al. proposed Huatuo [25], which fine-tunes LLaMA by incorporating Chinese medical knowledge. This integration aims to improve the accuracy and professionalism of Chinese medical tasks, addressing the current deficiencies of LLMs in the Chinese medical domain. Yang et al. introduced BM25-LLM, which combines BM25 and a LLM to significantly enhance retrieval accuracy in biomedical question-answering systems using the BioASQ and TREC-COVID datasets [26]. Huang et al. developed Lawyer LLaMA, which combines legal expertise with an information retrieval module to provide professional legal consultation answers [27]. Additionally, Alghamdi et al. used GPT-3.5 Turbo as the base model and combined domain-specific fine-tuning with retrieval-augmented generation (RAG) techniques to optimize the performance of the healthcare conversational agent, ensuring it provides accurate and reliable medical information [28]. LLMs have demonstrated significant potential in agriculture. These models can provide data support and decision-making recommendations at various stages of agricultural production, such as offering personalized crop management advice to farmers through digital assistants [29,30,31]. Currently, some researchers have initiated preliminary investigations into the application of LLMs in agriculture. For example, Zhao et al. proposed a plant disease detection system that integrates LLMs with an agricultural knowledge graph. This approach combines graph neural networks, representation learning, and symbolic reasoning techniques to enhance both the accuracy and efficiency of disease detection [2]. Zhao et al. proposed the ChatAgri system, which showcases the cross-lingual transfer capabilities of LLMs in the context of agricultural text classification [32]. Yang et al. proposed the ShizishanGPT system, which showcases intelligent agricultural question-answering capabilities grounded in the RAG framework and an agent architecture [33].

Currently, the use of external APIs of LLMs in agricultural research has become quite mature. In contrast, research on fine-tuning LLMs with specialized knowledge in the agricultural pest and disease domain remains relatively scarce. Therefore, this paper proposes the development of a domain-specific LLM for pest and disease management through fine-tuning—IPM-AgriGPT. This initiative aims to further enhance the application of LLMs in agriculture, providing smarter and more precise support for agricultural production.

3. Materials and Methods

3.1. Dataset Collection

This section presents the data sources utilized in this study. The dataset employed in this paper comprises three components: web data, database data, and textual data. In the data cleaning process, this study utilized a combination of regular expressions and manual review to eliminate duplicates and remove redundant information from the raw data collected through web scraping. This cleaned and structured dataset serves both as the corpus for model pre-training and as a high-quality foundation for the subsequent development of the question–answer corpus. The dataset format of the baseline corpus is illustrated in Figure 1.

The final dataset contains a total of 5704 entries. The fundamental statistics regarding the corpus collection are illustrated in Table 1.

3.1.1. Web

The web data primarily originate from open-source pest and disease control websites on the Internet that allow usage. For pest-related pages, the scraped content included various aspects such as pest names, distribution and damage, morphological characteristics, occurrence patterns, and control methods. On the other hand, for disease-related pages, the collected information comprised the disease name, distribution and damage, symptoms, pathogen involved, occurrence patterns, and control strategies. To systematically collect these data, the Scrapy framework was employed by iterating through target paths according to predefined rules. Scrapy is an open-source web scraping framework based on Python, recognized for its efficiency, flexibility, and scalability. Using Scrapy enabled rapid development of the web scraping program while allowing customization of scraping rules, parsing logic, and data storage methods to ensure effective data collection from the target website. In this study, a dedicated crawler was developed using the Scrapy framework to comprehensively gather pest and disease information. As a result, 2498 pieces of text data related to pests and diseases were successfully scraped from the target website. The specific types and quantities of the collected data are illustrated in Figure 2.

3.1.2. Database

The data for this database are sourced from the Agricultural Pest and Disease Image and Text Database, which encompasses 31 distinct types of pests, diseases, and weeds belonging to various families and genera. Given that certain pests, diseases, and weeds are found across multiple crops, the dataset underwent a process of deduplication and cleaning, culminating in a total of 2833 records. The specific categories and quantities are depicted in Figure 3.

3.1.3. Text

This study has also compiled relevant books and academic papers, which provide a solid theoretical foundation and practical insights for effective pest and disease management. The collected textual data includes research articles from scholarly journals and literature related to agricultural pests and diseases, as well as pertinent reports. These documents address various aspects, including the mechanisms underlying the emergence of pests and diseases, their ecological impacts, control methodologies, and management strategies.

3.2. Dataset Preprocessing

This section outlines the data processing procedure. Given that the corpus for fine-tuning must be formatted in a question–answer structure, manually generating a substantial amount of question–answer data would necessitate considerable human effort. Consequently, this study employed the G-EA framework to produce high-quality question–answer corpora. A total of 27,160 entries were generated for the corpus, and the format of the question–answer corpus is depicted in Figure 4.

3.2.1. Corpus Construction

This study employed the API of LLMs to generate question–answer datasets through unsupervised text generation instructions [34]. This approach facilitates the creation of question–answer datasets pertinent to the pest and disease domain utilizing the aforementioned data. To ensure that the generated question–answer corpus is suitable for fine-tuning, specific instructions and templates were provided to the LLMs, as illustrated in Figure 5 and Figure 6.

Due to the extensive number of calls made to the LLMs, even when explicit requirements are outlined in the template, it is inevitable that some generated data may not conform to expected standards. For instance, the produced dataset might contain contextual inaccuracies; certain questions could be irrelevant to the provided information; or there may exist extraneous instructions unrelated to the given data. These issues can result in unprofessional content, as illustrated in Figure 7. Addressing these low-quality data necessitates considerable human effort and poses a risk of inducing hallucination issues within the model if utilized for fine-tuning. This, in turn, can adversely affect both the professionalism and accuracy of the models. Consequently, stringent quality control measures for data are essential to ensure the reliability and effectiveness of the models.

3.2.2. Generation-Evaluation Adversarial Framework

Numerous researchers have proposed various methods aimed at enhancing the quality and diversity of text generation, such as human intervention [35], innovative instruction prompting [36], and dialogue generation language modeling [37]. To tackle the problem of low-quality text generation while managing costs, this study proposes a self-feedback framework based on the model itself: the G-EA framework. This framework is designed to improve both the quality and diversity of data generated by the model. Unlike the aforementioned methods, which rely on external interventions or fixed prompting strategies, the G-EA framework leverages the model’s internal reasoning process to iteratively refine and evaluate the generated content. This self-feedback mechanism enables the model to independently assess and enhance the quality of its output, thus offering a more cost-effective and scalable solution for high-quality text generation. The flowchart is shown in Figure 8. The core principles of the G-EA framework are shown in Figure 9.

In the framework proposed in this study, two phases are introduced: generation and adversarial evaluation. The baseline corpus $D$ is segmented into a sequence collection $D = \{d_{1}, d_{2}, \dots, d_{m}\}$ where each $d_{i}$ represents a fragment of the baseline corpus. Key information $s_{i}$ is extracted from $d_{i}$ and replaced with $[M A S K]$ tokens to construct the corrupted text $d_{i}^{c o r r u p}$ . Based on $d_{i}^{c o r r u p}$ , the prompt $P$ , and the prompt template $T$ , and the previously generated question fragments $q_{z \leq i - 1}$ questions are generated by maximizing the conditional probability:

(1) $Q = \max_{θ} [\sum_{i = 1}^{m} \log p_{θ} (q_{i}| d_{i}^{c o r r u p t}, q_{z \leq i - 1}, P, T)]$

Based on the question sequence $Q = \{q_{1}, q_{2}, \dots, q_{m}\}$ , key information $s_{i}$ , and the previously generated answer fragments $a_{z \leq i - 1}$ , answers are generated by maximizing the conditional probability:

(2) $A = \max_{θ} [\sum_{i = 1}^{m} \log p_{θ} (a_{i}| {a_{z \leq i - 1}, q}_{i}, s_{i})]$

Key information is extracted from the generated question–answer pair $(A, Q)$ to form the set $C$ , which is compared with the key information set $S$ derived from $d_{i}$ . This comparison yields the intersection $I$ and the differences $C^{'}$ and $S^{'}$ . For each element $c_{i} \in C^{'}$ and $s_{j} \in S^{'}$ , semantic similarity is computed as follows:

(3) $S i m (C^{'}, S^{'}) = \frac{1}{∣ C^{'} ∣ \cdot ∣ S^{'} ∣} \sum_{c_{i} \in C^{'}} \sum_{s_{j} \in D^{'}} \frac{\vec{c_{i}} \cdot \vec{s_{j}}}{∥ \vec{c_{i}} ∥ \cdot ∥ \vec{s_{j}} ∥}$

The intersection coverage is as follows:

(4) $C o v (I, S) = \frac{∣ I ∣}{∣ S ∣}$

The comprehensive score $Δ$ is defined as a weighted combination of the coverage and semantic similarity:

(5) $Δ = α \cdot C o v (I, S) + β \cdot S i m (C^{'}, S^{'})$

where

α

and

β

are weight parameters that balance the influence of coverage and similarity. Finally, the generated question–answer pairs are evaluated against a

δ

. If

Δ

satisfies the following condition, the generation is considered successful, the generated question–answer pairs are closely related to the content of the baseline corpus:

(6) $Δ \geq δ$

Otherwise, the generated question–answer pairs require further optimization or regeneration.

3.3. Proposed Method

This section elaborates on the key techniques employed to fine-tune IPM-AgriGPT, as illustrated in Figure 10. Full fine-tuning (FFT) of LLMs typically requires substantial computational resources and a wide range of data corpora. To facilitate a more efficient and practical model development process, this study adopts parameter-efficient fine-tuning (PEFT). Initially, the base model ChatGLM3-6B undergoes pre-training using a base corpus to establish a broad understanding of pest and disease knowledge. In the SFT phase, this study combines chain-of-thought distillation (CoTD) [39,40] with agricultural contextual reasoning. GPT-4 serves as the teacher model, guiding the pre-trained model to perform agricultural contextual reasoning and subsequently generate answers. This process enhances the model’s reasoning abilities, making it focus not only on final answers but also on optimizing multi-step reasoning capabilities. Additionally, during training, a domain-specific question–answer corpus is used, and training is performed based on LoRA (low-rank adaptation), leading to the development of IPM-AgriGPT.

The overall objective of this domain-specific fine-tuning approach is to improve the model’s practical utility and reliability within the agricultural sector, thereby enhancing its performance in specialized professional tasks.

3.3.1. Agricultural Contextual Reasoning CoTD

Knowledge distillation [41,42] is a widely utilized technique in the domain of deep learning, centered on the fundamental concept of training a smaller student model to replicate the behavior and performance of a larger teacher model. This approach effectively reduces the computational resource requirements associated with model deployment, while also enhancing efficiency and preserving performance levels.

Emergent abilities distillation (EAD) is an extension of traditional knowledge distillation, which focuses on transferring emergent abilities exhibited by LLMs. Knowledge distillation techniques grounded in emergent abilities encompass in-context learning (ICL) [43], chain of thought (CoT) [39], and instruction following (IF) [44]. These attributes significantly enhance the performance and accuracy of the models when tackling intricate tasks.

This study proposes ACR-CoTD. Unlike traditional CoT distillation methods, which typically rely on teacher models to add labels and reasoning for fine-tuning question–answer pairs in classification tasks, the LLM developed in this study is specifically designed for generative tasks. The core principle of this method is to optimize the reasoning process so that the student model generates answers as close as possible to the standard answer. Specifically, given a question and its known answer, the goal of this study is to identify the optimal reasoning process that minimizes the discrepancy between the generated answer and the standard answer. In this process, the teacher model (e.g., GPT-4) combines the question and reasoning process to provide the correct answer generation strategy. The student model then imitates the reasoning process of the teacher model, gradually optimizing its own reasoning strategy to improve the accuracy of the generated answers. The teacher model plays a guiding role, providing the correct path for answer generation, while the student model improves its reasoning process through continuous optimization, progressively approximating the standard answer.

This study, based on CoT, aims to derive the optimal reasoning process $R$ given a question $Q$ and a known answer $A$ , such that the answer $A_{g e n}$ generated by $Q$ and $R$ is as close as possible to the known answer $A$ . To achieve this, the problem is formulated as an optimization problem, where the optimal reasoning process $R^{*}$ , which minimizes the discrepancy between the generated answer $A_{g e n}$ and the known answer $A$ , is determined by the loss function $L (A, A_{g e n})$ . The formula is as follows:

(7) $R^{*} = \arg \min_{R} L (A, A_{g e n})$

$L (A, A_{g e n})$ is the loss function that measures the difference between the generated answer and the target answer. $A_{g e n}$ represents the answer generated by the question $Q$ and reasoning process $R$ , defined by the following relation:

(8) $A_{g e n} = f (Q, R)$

This study uses GPT-4 as the teacher model, based on an autoregressive generation task, to generate answers through the question and reasoning process. Therefore, the expanded formula is as follows:

(9) $f (Q, R) = \max_{θ} [\sum_{i = 1}^{m} \log p_{θ} (a_{i}^{g e n}| a_{z \leq i - 1}^{g e n}, Q, R)]$

Each fragment $a_{i}^{g e n}$ is generated based on the previously generated fragments $a_{z \leq i - 1}^{g e n}$ , question $Q$ , and reasoning process $R$ , with the goal of maximizing the likelihood of the correct answer sequence.

The probability of generating a fragment $a_{i}^{g e n}$ can be factorized as follows:

(10) $p_{θ} (a_{i}^{g e n}| a_{z \leq i - 1}^{g e n}, Q, R) = \prod_{i = 1}^{n} p (a_{i}^{g e n}| a_{z \leq i - 1}^{g e n}, a_{i < j}, Q, R)$

To obtain the optimal reasoning process $R^{*}$ , this paper optimizes the reasoning process by minimizing the loss function $L (A, A_{g e n})$ . The formula is the following:

(11) $\nabla_{R} L (A, A_{g e n}) = \frac{\partial L (A, A_{g e n})}{\partial R}$

By computing this gradient, it is possible to determine how to adjust $R$ in order to minimize the loss function and make the generated answer $A_{g e n}$ closer to the target answer $A$ .

Finally, the parameters of the reasoning process $R$ are updated according to the gradient:

(12) $R_{t + 1} = R_{t} - η \cdot \nabla_{R} L (A, A_{g e n})$

where

η

is the learning rate, controlling the step size in each update, and

R_{t}

is the reasoning process at step

t

. This process continues iteratively until the loss function converges to a minimum, yielding the optimal reasoning process

R^{*}

To integrate the reasoning process $R$ into the downstream training of the base model, we combine $R$ with the known answers $A$ . The newly derived set of answers, denoted as $A^{'}$ , is expressed as follows:

(13) $A^{'} = R_{1}^{n} ⋃_{i = 1}^{n} A_{1}^{n}$

This methodology allows the CoT from GPT-4 to gradually guide the student model in acquiring stepwise reasoning capabilities, ensuring that it is able to both produce a final answer and accurately articulate the reasoning process underlying each question.

3.3.2. SFT Based on Pest and Disease Data

SFT is the process of further fine-tuning a model through supervised learning on a specific task dataset after the pre-training phase. The commonly used objective function is the cross-entropy loss function. The model’s training objective is to optimize the parameters by minimizing the total loss, which is expressed as follows:

(14) $L = \frac{1}{N} \sum_{i = 1}^{N} l (f (q_{i}), \hat{a_{i}^{'}})$

where

f (q_{i})

represents the model’s predicted output for the input sample

q_{i}

, and

\hat{a_{i}^{'}}

is the true label for that sample. By comparing the predicted value with the true label, the total loss

L

is minimized, and the model parameters are updated by averaging the loss over all samples. The loss function

l (f (q_{i}), \hat{a_{i}^{'}})

is specifically represented as follows:

(15) $l (f (q_{i}), \hat{a_{i}^{'}}) = - l o g P (a_{i}^{'}∣ q_{i}; θ)$

By adjusting the parameters $θ$ , the model optimizes the predicted result $\hat{a_{i}^{'}}$ , minimizing the difference between the generated answer and the true label $\hat{a_{i}^{'}}$ . This process, by minimizing the negative log-likelihood, progressively improves the model parameters, enabling it to better adapt to the task-specific data distribution, thereby enhancing the accuracy and performance of the model.

3.3.3. LoRA

Low-rank adaptation (LoRA) [21] is an efficient parameter fine-tuning method that modifies the weights of a pre-trained model through low-rank decomposition to align with the requirements of specific tasks or domains. The fundamental principle underlying LoRA is to preserve most of the pre-trained weights while only adjusting a small subset of the model’s parameters, thereby minimizing computational and storage overhead during the fine-tuning process. The core equation is the following:

(16) $h = W_{0} x + Δ W x$

where

W_{0}

denotes the initial weight matrix, which remains fixed throughout the fine-tuning process and does not undergo gradient updates. The weight update component

Δ W

is represented by two low-rank matrices

A

and

B

(17) $Δ W = B A$

Matrix $A$ is initialized randomly using a Gaussian distribution, whereas matrix $B$ starts with zero values. Through this low-rank structure, LoRA facilitates efficient fine-tuning with minimal resource demands.

In the context of optimizing LoRA for a downstream task, the optimization objective can be expressed as follows:

(18) $\max_{θ} \sum_{(x, y) \in Z} \sum_{t = 1}^{∣ y ∣} l o g P_{W_{0} + B A (θ)} (y_{t}∣ x, y < t)$

In this equation, $θ$ represents the parameters of the low-rank matrices $A$ and $B$ . $W_{0}$ is the fixed pre-trained weight matrix, and $B A$ is the weight applied to the model during fine-tuning. By using low-rank decomposition, LoRA reduces the number of parameters that need to be updated, making the fine-tuning process more computationally efficient, particularly when dealing with LLMs. This allows LoRA to adapt the model to specific tasks or domains without requiring extensive resources, thus making it a practical solution for resource-constrained environments.

3.4. Experimental Setup

3.4.1. Hardware Platform

In terms of hardware configuration, this study employed a single NVIDIA A800-80 GB GPU, which features 80 GB of VRAM and is capable of efficiently managing large-scale model training tasks. The server was equipped with a 14-core Intel(R) Xeon(R) Gold 6348 processor operating at 2.60 GHz, complemented by 100 GB of RAM to provide sufficient resources for data processing and complex computations. The operating system utilized was Ubuntu 22.04, while Python 3.10 served as the primary programming language. The deep learning framework selected for this research was PyTorch version 2.1.0, which utilized CUDA version 12.1 to enable efficient GPU computation and enhance training performance. Furthermore, the Transformers [45] library and PEFT [46] library were employed for model fine-tuning, and DeepSpeed was integrated to enhance the training process further, ensuring optimal performance when handling large-scale datasets.

3.4.2. Base Model

In the development of IPM-AgriGPT, this study selected ChatGLM3-6B as the foundational model. The primary rationale for selecting ChatGLM3-6B as the base model stems from its moderate parameter size, which facilitates efficient training under resource-constrained conditions. With its 6 billion parameters, ChatGLM3-6B leverages extensive training data and optimized training strategies, excelling in tasks such as Chinese semantic processing, accurate answering, natural dialogue generation, and complex information retrieval. As of 27 October 2023, ChatGLM3 has been evaluated on eight representative datasets in both Chinese and English languages, achieving scores that surpass those of other pre-trained models with fewer than 10 billion parameters within these datasets [47]. Consequently, choosing ChatGLM3-6B as the foundational model for IPM-AgriGPT not only meets the requirements of this study for applications in Chinese pest and disease management but also guarantees professionalism and accuracy in the generated content.

3.4.3. Pre-Training

To establish a robust foundational understanding and generation capabilities of the pest and disease detection model within the agricultural pest and disease domain, this study utilized a pest and disease corpus containing 5704 text entries for pre-training the LLMs [48]. During the pre-training phase, various hyperparameters were meticulously configured to enhance model performance. To address computational resource limitations, the batch size was set to 4. Furthermore, an initial learning rate of 2 × 10⁴ was established, accompanied by the implementation of a cosine annealing strategy for optimization purposes.

3.4.4. Fine-Tuning

In this study, the question–answer dataset was divided into a training set and a test set in an 8:2 ratio. During the training phase, the batch size was established at 4, with an initial learning rate of 5 × 10⁴. A cosine annealing strategy was employed to ensure the model’s stability and facilitate its gradual convergence throughout the training process. The entire training procedure was scheduled to run for 30 epochs. Additionally, the LoRA-specific configuration included the following: type set to LoRA, r set to 8, alpha set to 32, and dropout set to 0.1. For evaluation purposes, this study configured the evaluation batch size per device to be 16 and adopted a step-based evaluation strategy. Model performance was assessed every 500 steps to enable real-time monitoring during training. If no significant improvement in performance was observed across consecutive evaluations, the training process would be automatically halted.

3.5. Evaluation Metrics

This section delineates the evaluation methodology and presents the corresponding results along with relevant benchmarks for the IPM-AgriGPT model developed in this study. To validate the effectiveness of the fine-tuning process, several baseline models were selected for comparative analysis against the fine-tuned model. Given the specificity inherent to vertical domain models, benchmarks designed for LLMs often inadequately assess their performance within specialized domains. Consequently, this study specifically devised an evaluation benchmark that encompasses agricultural pest and disease knowledge to facilitate a comparison between the fine-tuned and non-fine-tuned models, thereby providing a more precise assessment of model performance in the context of agricultural pests and diseases.

To construct this benchmark, we collected a diverse range of content from multiple sources to ensure it is specifically tailored to the domain of pest and disease management. These sources include agricultural news articles, real-world case reports, and professional plant protection examination papers. The benchmark dataset covered a wide variety of agricultural pest and disease scenarios, encompassing different crops, pest species, and environmental conditions.

To provide a comprehensive evaluation, the benchmark established in this research focuses on three key aspects [25]: professionalism, safety, and effectiveness. Professionalism: this aspect assesses the scientific accuracy and correctness of the knowledge reflected in the recommendations of the models, ensuring that diagnostic and control solutions are grounded in reliable plant pathology and entomology principles. Safety: this dimension evaluates whether proposed solutions pose any adverse effects on environmental integrity or crop safety during application, ensuring that recommendations from the model achieve effective control without compromising product quality or ecological sustainability. Effectiveness: This criterion examines proposed control methods to ensure they are both safe and practical in achieving desired outcomes. It includes considerations such as appropriate pesticide concentrations, application intervals, and preventive measures tailored to field management practices while aligning with specific agricultural contexts.

3.5.1. Baseline

To conduct a comparative analysis, this study selected several LLMs with comparable parameter sizes that have demonstrated robust performance in processing Chinese as baselines. These include ChatGLM3-6B, Baichuan2-7B, and Chinese-llama2-7B. With parameter sizes of 6 billion (6B) and 7 billion (7B), these models have excelled in various Chinese NLP tasks, making them ideal candidates for this comparative study. Furthermore, GPT-3.5-Turbo serves as a commercial LLM reference point, providing a benchmark for real-world application scenarios. By comparing these advanced baseline models with IPM-AgriGPT, this study aims to gain deeper insights into the capabilities of the latter within specific domains while also exploring the performance differences among various models when addressing tasks. This approach will facilitate the optimization and adjustment of algorithms to ensure high-quality recommendations in professional and practical contexts.

3.5.2. Objective Task

This study developed 600 objective tasks, each pertaining to professionalism, safety, and effectiveness. Each task is presented as a multiple-choice question with four options, of which only one is the correct answer. During the evaluation process, prompts were designed to guide the LLMs in selecting an answer from the provided options. These prompts included not only the problem description and four possible answers but also utilized the generative capabilities of the models to identify the most likely correct response. Accuracy served as the primary metric for evaluation. The testing employed a 5-shot approach, wherein the model was initially presented with five solved example questions before attempting to address each new question. This method facilitates contextual learning, allowing the model to better comprehend the questions and make accurate selections [49]. This study used precision to measure the model’s correct response rate in the 5-shot setting. Precision is calculated using the following formula:

(19) $P r e c i s i o n = \frac{T P}{T P + F P}$

To comprehensively assess performance across various tasks, accuracy scores for different task types were calculated to derive an overall performance score for safety, effectiveness, and professionalism.

3.5.3. Subjective Task

This study developed 200 subjective tasks centered on safety, effectiveness, and professionalism, necessitating the model to generate detailed responses. Specifically designed prompts were employed to guide the LLMs in producing relevant, thorough, and precise answers to the questions posed. Following the generation of responses by the model, GPT-4 was utilized to evaluate these answers. GPT-4 was selected as the evaluation tool due to its hundreds of billions of parameters, which provide exceptional semantic understanding, reasoning, and generation capabilities, particularly in multi-lingual contexts. Through meticulous analysis of textual grammar, semantics, and context, GPT-4 effectively evaluates the accuracy and logic of generated content, thereby ensuring the reliability of subjective task assessments. Furthermore, the extensive pre-training of GPT-4 equips it with a high degree of generalization across various tasks, enabling it to adapt to specialized knowledge from diverse domains and guaranteeing that the evaluation process remains both thorough and rigorous. In the evaluation process, this study referenced established evaluation standards [50,51]. To determine whether the generated responses could be deemed acceptable by users, the assessment centered on six primary criteria: coherence, logical consistency, fluency, relevance, comprehensibility, and sufficiency. The prompts used for evaluation, scoring criteria, and score calculation methods are illustrated in Figure 11, Figure 12 and Figure 13.

Additionally, this study conducted an independent evaluation of the accuracy of the responses to ascertain whether the key information provided was correct, as illustrated in Figure 14.

The standard correct answers were compared with the responses generated by the model, and GPT-4 was utilized to assess this comparison, thereby evaluating the performance of the models in terms of accuracy. Ultimately, the final score for the model was calculated based on different weights assigned to seven evaluation criteria. The formula is as follows:

(20) $F i n a l S c o r e = 0.1 \times \sum_{x \in S} x + 0.4 \times {V a l}_{A c c}$

$S = \{{V a l}_{C o h}, {V a l}_{C o n}, {V a l}_{F l u}, {V a l}_{R e l}, {V a l}_{C o m}, {V a l}_{E x h}\}$ represent the scores for six criteria: coherence, logical consistency, fluency, relevance, comprehensibility, and sufficiency. ${V a l}_{A c c}$ represents the score for accuracy. Through this multidimensional weighted scoring system, GPT-4 can make more precise assessments regarding its performance in terms of safety, effectiveness, and professionalism. This approach objectively validates the relevance of the models in the field of agricultural pest and disease management, thus enabling this study to gain a deeper understanding of the model’s generative capabilities and potential for real-world applications.

4. Results

4.1. Evaluation Results

From Table 2, it is evident that all models demonstrated superior performance in safety-related tasks, followed by professionalism, and finally effectiveness. This trend suggests that current LLMs exhibit strong adaptability when managing safety-related issues yet show slightly reduced competence in professionalism and relatively weaker performance in addressing the effectiveness of practical application tasks. Specifically, GPT-3.5-Turbo performed exceptionally well in safety with a score of 0.7230, achieving scores of 0.6100 in professionalism and 0.4990 in effectiveness, resulting in an average score of 0.6110. Baichuan2-7B scored 0.6490 in safety, 0.5920 in professionalism, and 0.4110 in effectiveness, for an average score of 0.5510. This demonstrates particularly strong performance, especially within safety-related tasks. IPM-AgriGPT showed balanced results, scoring 0.6360 in safety, 0.5870 in professionalism, and 0.3920 in effectiveness, with an average score of 0.5380, indicating good adaptability across specialized domains. Conversely, ChatGLM3-6B and Chinese-llama-7B displayed weaker performance, with average scores of 0.4630 and 0.3570, respectively. ChatGLM3-6B scored 0.5800 in safety, 0.4940 in professionalism, and only 0.3150 in effectiveness, while Chinese-llama-7B obtained scores of 0.4420 in safety, 0.3740 in professionalism, and 0.2560 in effectiveness—highlighting significant shortcomings in both professionalism and effectiveness when handling practical application tasks.

According to Table 3, GPT-3.5-Turbo exhibited the most superior performance across all tasks, particularly demonstrating remarkable stability and adaptability in various aspects of language quality. Baichuan2-7B closely followed, with only minor discrepancies in most metrics when compared to GPT-3.5-Turbo, showcasing robust overall capabilities. The fine-tuned IPM-AgriGPT displayed comparable or slightly enhanced performance relative to ChatGLM3-6B, suggesting that the fine-tuning process did not significantly compromise language quality. However, regarding sufficiency, IPM-AgriGPT received a slightly lower score than ChatGLM3-6B; nonetheless, this difference remained within an acceptable range. In contrast, Chinese-llama-7B ranked the lowest across all dimensions, particularly with the lowest scores in coherence and sufficiency, highlighting its limitations in handling complex tasks.

According to the results presented in Table 4, GPT-3.5-Turbo performed the best across tasks related to professionalism, safety, and effectiveness. IPM-AgriGPT demonstrated significant differences in accuracy scores compared to other models. Baichuan2-7B exhibited good performance in various tasks, with scores of 4.1705 in professionalism, 4.2525 in safety, and 4.1920 in effectiveness. IPM-AgriGPT outperformed Baichuan2-7B, achieving scores of 4.3105 in professionalism, 4.3505 in safety, and 4.2395 in effectiveness, all exceeding Baichuan2-7B’s corresponding scores. ChatGLM3-6B and Chinese-Llama-7B had relatively lower accuracy scores, with Chinese-Llama-7B showing significant limitations in handling complex tasks.

The results presented in Table 5 indicate that GPT-3.5-Turbo achieved the highest accuracy score across the three task categories, attaining a score of 4.5087. This demonstrates its exceptional language quality and accuracy compared to all other models evaluated. Following closely was IPM-AgriGPT, which recorded an average score of 4.4492, reflecting high levels of professionalism and reliability after fine-tuning within the agricultural pest and disease domain. IPM-AgriGPT’s scores for professionalism, safety, and effectiveness were 4.4566, 4.4526, and 4.4386, respectively; these figures slightly exceeded those of Baichuan2-7B, which scored 4.3993 for professionalism, 4.4368 for safety, and 4.4345 for effectiveness. This highlights the significant enhancements made by the fine-tuned model in specific domain tasks. In contrast, ChatGLM3-6B achieved an average score of 4.3731; while it demonstrated relatively stable performance across tasks, its scores were lower than those of both IPM-AgriGPT and Baichuan2-7B.

Chinese-Llama-7B exhibited the lowest overall scores with an average of only 3.9498, and particularly notable was its effectiveness task score of just 3.7674, which was significantly inferior to those obtained by other models assessed in this study; this reflects its limitations in managing complex tasks effectively as well as achieving high accuracy levels. Chinese-Llama-7B exhibited the lowest overall scores with an average of only 3.9498, and particularly notable was its effectiveness task score of just 3.7674, which was significantly inferior to those obtained by other models assessed in this study. As a fine-tuned model specifically designed for agricultural applications, IPM-AgriGPT displayed strong professional adaptability with slight improvements over Baichuan2-7B while significantly outperforming both ChatGLM3-6B and Chinese-Llama-7B.

4.2. Ablation Experiment Results

As illustrated in Table 6, IPM-AgriGPT demonstrated superior performance compared to other models that were fine-tuned using various strategies across both objective and subjective tasks. This indicates a significant enhancement attributable to the G-EA and distillation fine-tuning methodologies. In terms of objective tasks, IPM-AgriGPT achieved scores of 0.5870, 0.6360, and 0.3920 for professionalism, safety, and effectiveness, respectively, resulting in an average score of 0.5380—surpassing both ChatGLM3-6B-Tuned and ChatGLM3-6B-G-EA. The ChatGLM3-6B-Tuned model was fine-tuned without employing G-EA or distillation techniques, leading to relatively lower scores. Conversely, ChatGLM3-6B-G-EA utilized the corpus generated under the G-EA framework and exhibited some improvement across all objective tasks when compared to ChatGLM3-6B-Tuned; however, it still fell short of matching the performance of IPM-AgriGPT. In subjective tasks as well, IPM-AgriGPT excelled in professionalism, safety, and effectiveness with scores of 4.4566, 4.4526, and 4.4386, respectively—culminating in an average score of 4.4492 that outperformed both ChatGLM3-6B-Tuned and ChatGLM3-6B-G-EA models. While ChatGLM3-6B-G-EA showed some advancements in subjective task evaluations—particularly regarding safety and professionalism—it remained behind IPM-AgriGPT overall. With the integration of G-EA alongside distillation optimization strategies, IPM-AgriGPT significantly improved its adaptability and performance within this specific domain while particularly excelling in professionalism and safety-related tasks.

5. Discussion

5.1. Results Analysis

The experimental results indicate that IPM-AgriGPT outperforms the baseline model ChatGLM3-6B across all dimensions of objective tasks, thereby demonstrating that fine-tuning can significantly enhance a model’s performance in specific domains. The fine-tuned IPM-AgriGPT achieved scores comparable to those of Baichuan2-7B in professionalism and safety tasks, particularly excelling in its ability to understand and process specialized knowledge. The fine-tuning process markedly improved the model’s accuracy and practicality. In subjective tasks, both Baichuan2-7B and GPT-3.5-Turbo, each with over 7 billion parameters, achieved the highest scores across six language quality indicators. This outcome highlights the advantages of larger parameter models in language generation and understanding. Although IPM-AgriGPT has a slightly lower parameter count, it still outperformed ChatGLM3-6B in the first five language quality indicators; however, it received a marginally lower score in the comprehensiveness indicator. The difference is not substantial and remains within an acceptable range. This phenomenon may be attributed to the robust language-understanding capabilities of the base model ChatGLM3-6B. During the fine-tuning process, some original weights were overwritten by the generative weights of IPM-AgriGPT, resulting in a “Catastrophic Forgetting (CF)” effect [52]. Nevertheless, during this fine-tuning phase, IPM-AgriGPT utilized GPT-4 as a teacher model and was trained under CoT guidance. This approach mitigated any decline in linguistic capabilities and ensured that IPM-AgriGPT maintained high performance standards in professional tasks after fine-tuning. In contrast, Chinese-Llama-7B recorded the lowest scores across all six language quality indicators. This result likely stems from its derivation from LLaMA combined with fine-tuning on Chinese data; however, LLaMA itself was not optimized for a Chinese context. Consequently, responses generated by this model frequently exhibit knowledge confusion and repetition (as illustrated in Figure 15 and Figure 16), indicating that it still faces limitations when processing Chinese text.

In terms of accuracy metrics, IPM-AgriGPT exhibited significant advantages following fine-tuning, achieving scores that were significantly higher than those of Baichuan2-7B, ChatGLM3-6B, and Chinese-llama-7B. Models that lacked fine-tuning with domain-specific knowledge often produced responses that seemed reasonable but displayed considerable deviations and errors when compared to standard answers, leading to instances of “hallucinations” [53], as illustrated in Table 7. Through targeted fine-tuning within the domain of agricultural pest and disease management, IPM-AgriGPT was able to generate responses that were consistent with professional knowledge, thereby enhancing accuracy in specific tasks. This advantage is particularly pronounced in tasks requiring a high level of expertise, thus improving the model’s applicability in real-world scenarios. In the ablation experiments, models fine-tuned using low-quality data encountered issues related to knowledge confusion and repetitive generation. The primary reason for this phenomenon is that low-quality data may contain incorrect, incomplete, or redundant information; such deficiencies lead the model to acquire inaccurate knowledge during fine-tuning processes. Consequently, this results in inconsistencies and challenges in establishing a coherent knowledge structure. As a result, the generated content frequently reflects both confusion regarding knowledge and repetition. Furthermore, when token length restrictions were not imposed, there was a marked increase in both generation time and content length—even though prompts explicitly specified limiting responses to 300 words—indicating a decline in the model’s capacity for language comprehension tasks. For models subjected to fine-tuning utilizing the G-EA framework-generated corpus without subsequent distillation procedures, while they retained most language comprehension capabilities inherent to the base model, they demonstrated insufficient reasoning concerning questions posed. This limitation resulted in responses that were often brief and inadequately addressed the inquiries presented.

Despite having fewer parameters compared to Baichuan2-7B and GPT-3.5-Turbo, the domain-specific fine-tuning of IPM-AgriGPT has significantly enhanced its adaptability, particularly excelling in agricultural pest and disease management. The experimental results indicate that the effectiveness of the models is not solely determined by the number of parameters; rather, it is largely influenced by the efficacy and appropriateness of the fine-tuning strategy employed. The performance of IPM-AgriGPT demonstrates that effective fine-tuning within a specific domain can yield high-quality, accurate responses even with a smaller parameter size, thereby showcasing a substantial advantage in specialized applications.

5.2. Limitation

During the data construction phase, the question–answer corpus for IPM-AgriGPT was generated using a general-purpose LLM based on the GE-A framework. While this method effectively reduces the occurrence of low-quality data, it should be noted that it can only mitigate, rather than entirely eliminate, the emergence of low-quality question–answer pairs. When generating large amounts of question–answer pairs, a small number of low-quality pairs may still appear. Therefore, this approach may not completely eliminate the need for manual inspection and guidance by domain experts during data generation, and the GE-A framework still requires further optimization during the evaluation phase. Additionally, although IPM-AgriGPT outperforms other benchmark models with similar parameter sizes in final results, its performance in evaluating some objective and subjective tasks, as measured by six generalization metrics, still lags behind that of a benchmark model with a larger parameter size than IPM-AgriGPT. This highlights the limitations of smaller parameter models in language comprehension. Therefore, training with larger parameter models will be key to enhancing IPM-AgriGPT’s ability to solve specialized pest and disease issues. Furthermore, IPM-AgriGPT’s effectiveness score is significantly lower than the other two aspects, reflecting insufficient training data in certain areas. To address this, it is crucial to collect more knowledge in specific agricultural domains, such as geography, meteorology, and crop growth stages, as these context-specific datasets will play a significant role in improving IPM-AgriGPT’s ability to tackle pest- and disease-related problems.

5.3. Future Work

The experimental results indicate that IPM-AgriGPT exhibits remarkable adaptability and professionalism in the domain of pest and disease knowledge, particularly excelling in tasks related to professionalism and safety. However, the model faces limitations when new pest and disease knowledge, not included in the previous fine-tuning data, becomes available, as it requires re-finetuning. This process consumes significant computational resources and may reduce the model’s language comprehension, as new weights overwrite existing ones. In addition, the resource constraints in practical agricultural production could restrict the model’s widespread use. Future research should focus on optimizing storage and computational requirements to make the system compatible with edge computing devices. One promising approach is to explore the use of retrieval-augmented generation (RAG) [55] technology, which could allow for dynamic updates without the need for frequent fine-tuning, ensuring the model’s knowledge remains up to date. Secondly, integrating multimodal data [56]—such as soil data, weather data, and images—can significantly enhance IPM-AgriGPT’s ability to address a wider range of agricultural challenges, including crop growth monitoring, soil health analysis, and environmental impact assessments. Additionally, by combining these data with pest and disease management, the model can not only improve the precision of pest detection and control but also enable more personalized pest management strategies, tailored to specific regions, crop varieties, and environmental conditions. The multimodal data will make IPM-AgriGPT more flexible and adaptive in agricultural management, ultimately enhancing the efficiency, sustainability, and environmental friendliness of agricultural production.

Although IPM-AgriGPT was specifically designed for pest and disease management in the agricultural domain using a Chinese LLM, the method employed fundamentally aims to improve the general LLM’ ability to solve specialized domain problems. Therefore, when addressing agricultural pest and disease management in small language countries, the key challenge is to collect large-scale pest and disease data in those languages and select a general pre-trained model that can effectively understand those languages. Existing LLMs are primarily developed for high-resource languages, such as ChatGPT and Llama, which focus on English, while ChatGLM, Baichuan, and Qwen focus on Chinese. However, Chinese and English together only cover about 60% of the world’s population. Many small languages have large user bases and important use cases [57]. Therefore, building LLMs for small languages will be a key direction for future research.

6. Conclusions

This paper focuses on pest and disease information management and intelligent question-answering systems, proposing a Chinese LLMs specifically tailored for this domain—IPM-AgriGPT. By incorporating the G-EA framework and ACR-CoTD techniques, the model effectively integrates expertise in pest and disease management, significantly enhancing the adaptability and accuracy of LLMs in specialized agricultural tasks. Experimental results demonstrate that IPM-AgriGPT excels in multiple dimensions, including professionalism, safety, and effectiveness, with comprehensive evaluations indicating its superior performance compared to other LLMs. Compared to the base model, the fine-tuned IPM-AgriGPT generates more accurate responses to complex pest- and disease-related inquiries. For subjective tasks, the professionalism, safety, and effectiveness scores are 0.5870, 0.6360, and 0.3920, respectively, whereas for objective tasks they are 4.4566, 4.4526, and 4.4386.

The primary contribution of this study lies in the innovative application of LLMs to the field of pest and disease control, offering an intelligent and practical solution for agricultural pest management. Additionally, a dedicated question–answer corpus and an optimized generation framework were developed, which not only improve the model’s accuracy but also provide valuable references for future research, particularly in the intelligent application of unstructured agricultural data. Future research will continue to explore more adaptive fine-tuning strategies and develop more efficient generation frameworks to further enhance the practicality and broad applicability of IPM-AgriGPT, thereby providing robust technical support for sustainable agricultural development.

Author Contributions

Conceptualization, Y.Z., M.L. and L.G.; methodology, Y.Z.; software, Y.Z. and Q.F.; validation, Q.F. and F.L.; formal analysis, F.L. and X.C.; investigation, Y.Z.; resources, M.L. and L.G.; data curation, Q.F. and X.C.; writing—original draft preparation, Y.Z. and Z.Z.; writing—review and editing, L.G. and Z.Z.; visualization, Q.F. and X.C.; supervision, F.L., L.G., and Z.Z.; project administration, F.L. and M.L.; funding acquisition, L.G. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy and ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Baseline corpus.

Figure 2. Specific types and quantities of web data.

Figure 3. Specific types and quantities from the database data.

Figure 4. The format of the question–answer corpus.

Figure 5. Prompt used for generating the question-answer corpus.

Figure 6. Templates used for generating the question-answer corpus in the self-instruction framework.

Figure 7. Low-quality data.

View Image - Figure 8. This figure illustrates the flowchart of the G-EA framework, which is divided into two stages: generation and evaluation. This framework’s foundational technology is the autoregressive blank infilling objective [38]. The process is divided into two stages: the generation stage and the evaluation stage. In the generation stage, the basic corpus is first divided into multiple sequences. Key information is extracted from each sequence, while the remaining sequences, which lack key information, are treated as 'corrupted text'. These damaged sequences, along with prompts and templates, are then used to generate questions. Based on the extracted key information, questions and prompts are employed to generate the corresponding answers. In the evaluation stage, the key information extracted from the generated question–answer pairs is compared with the key information from the original corpus. If the similarity between the two sets of key information exceeds a predefined threshold, the generated question–answer pairs are considered to be of high quality. If the similarity is below the threshold, the question–answer pairs are regarded as low quality.

Figure 8. This figure illustrates the flowchart of the G-EA framework, which is divided into two stages: generation and evaluation. This framework’s foundational technology is the autoregressive blank infilling objective [38]. The process is divided into two stages: the generation stage and the evaluation stage. In the generation stage, the basic corpus is first divided into multiple sequences. Key information is extracted from each sequence, while the remaining sequences, which lack key information, are treated as 'corrupted text'. These damaged sequences, along with prompts and templates, are then used to generate questions. Based on the extracted key information, questions and prompts are employed to generate the corresponding answers. In the evaluation stage, the key information extracted from the generated question–answer pairs is compared with the key information from the original corpus. If the similarity between the two sets of key information exceeds a predefined threshold, the generated question–answer pairs are considered to be of high quality. If the similarity is below the threshold, the question–answer pairs are regarded as low quality.

View Image - Figure 9. The core principles of the G-EA framework, highlighting the key steps involved in both the generation and evaluation processes.

Figure 9. The core principles of the G-EA framework, highlighting the key steps involved in both the generation and evaluation processes.

View Image - Figure 10. This figure illustrates the overall architecture of IPM-AgriGPT constructed in this study. The system integrates various advanced technologies, including SFT based on LoRA and ACR-CoTD.

Figure 10. This figure illustrates the overall architecture of IPM-AgriGPT constructed in this study. The system integrates various advanced technologies, including SFT based on LoRA and ACR-CoTD.

Figure 11. The evaluation prompts for subjective tasks primarily focus on six fundamental criteria.

Figure 12. Evaluation criteria.

Figure 13. Scoring calculation methods.

Figure 14. Evaluation criteria and scoring calculation methods for accuracy.

Figure 15. Knowledge confusion.

Figure 16. Repetition.

Table 1

Data sources and quantities.

Data Source	Quantity
Web	2498
Database	2833
Text	373

Table 2

Evaluation results based on the objective tasks, with the arrow (↑) indicating that a higher value corresponds to better performance.

Category	ChatGLM3-6B	Baichuan2-7B	Chinese-llama-7B	GPT-3.5-Turbo	IPM-AgriGPT
Professionalism ↑	0.4940	0.5920	0.3740	0.6100	0.5870
Safety ↑	0.5800	0.6490	0.4420	0.7230	0.6360
Effectiveness ↑	0.3150	0.4110	0.2560	0.4990	0.3920
Average	0.4630	0.5510	0.3570	0.6110	0.5380

Table 3

Evaluation results based on subjective task with six indicators by GPT-4, with the arrow (↑) indicating that a higher value corresponds to better performance.

Index	Category	ChatGLM3-6B	Baichuan2-7B	Chinese-llama-7B	GPT-3.5-Turbo	IPM-AgriGPT
Coherence ↑	Professionalism	4.4060	4.4475	4.2510	4.4625	4.4120
	Safety	4.3400	4.4695	4.3250	4.4810	4.4115
	Effectiveness	4.4225	4.4640	4.0130	4.4675	4.4595
Consistency ↑	Professionalism	4.4245	4.4655	4.3352	4.4755	4.5330
	Safety	4.4420	4.4745	4.2016	4.4915	4.4450
	Effectiveness	4.4195	4.4855	4.1235	4.5715	4.5295
Fluency ↑	Professionalism	4.4295	4.5565	4.3520	4.5745	4.4520
	Safety	4.3870	4.4360	4.3947	4.5955	4.3975
	Effectiveness	4.4645	4.5825	4.4462	4.6090	4.4680
Relevance ↑	Professionalism	4.7620	4.6250	4.8452	4.9320	4.8595
	Safety	4.7010	4.835	4.7741	4.9245	4.7860
	Effectiveness	4.7755	4.8625	4.6530	4.7815	4.8850
Comprehensibility ↑	Professionalism	4.6515	4.6650	4.4211	4.6285	4.6540
	Safety	4.5575	4.5975	4.3901	4.6300	4.5735
	Effectiveness	4.5620	4.6490	4.4355	4.6695	4.5700
Exhaustiveness ↑	Professionalism	4.4360	4.5510	4.1058	4.6497	4.4130
	Safety	4.5275	4.5459	4.0064	4.5640	4.5105
	Effectiveness	4.5105	4.5335	3.9528	4.5535	4.5155

Table 4

Evaluation results based on subjective task with accuracy metrics by GPT-4, with the arrow (↑) indicating that a higher value corresponds to better performance.

Category	ChatGLM3-6B	Baichuan2-7B	Chinese-llama-7B	GPT-3.5-Turbo	IPM-AgriGPT
Professionalism ↑	4.1287	4.1705	3.4297	4.3120	4.3105
Safety ↑	4.2025	4.2525	3.6750	4.3995	4.3505
Effectiveness ↑	4.1620	4.1920	3.0125	4.3380	4.2395

Table 5

Evaluation results based on subjective task by GPT-4, with the arrow (↑) indicating that a higher value corresponds to better performance.

Category	ChatGLM3-6B	Baichuan2-7B	Chinese-llama-7B	GPT-3.5-Turbo	IPM-AgriGPT
Professionalism ↑	4.3624	4.3993	3.7674	4.4971	4.4566
Safety ↑	4.3765	4.4368	4.0792	4.5285	4.4526
Effectiveness ↑	4.3803	4.4345	4.0029	4.5005	4.4386
Average	4.3731	4.4235	3.9498	4.5087	4.4492

Table 6

Ablation experiment results, with the arrow (↑) indicating that a higher value corresponds to better performance.

Task	Category	ChatGLM3-6B-Tuned	ChatGLM3-6B-G-EA	IPM-AgriGPT
Objective Task ↑	Professionalism	0.4420	0.5340	0.5870
	Safety	0.4530	0.6040	0.6360
	Effectiveness	0.2920	0.3240	0.3920
	Average	0.3960	0.4870	0.5380
Subjective Task ↑	Professionalism	4.2105	4.3215	4.4566
	Safety	4.0354	4.2995	4.4526
	Effectiveness	4.2274	4.3134	4.4386
	Average	4.1577	4.3115	4.4492

Table 7

Examples of responses to pest and disease issues from different baselines.

	Chinese	English
Question	木瓜锈病主要危害哪些部位，其病原体的形态特征有哪些？	What specific parts of the papaya plant are primarily affected by Gymnosporangium asiaticum [54], and what are the morphological characteristics of its causative pathogen?
Stander Answer	木瓜锈病主要危害叶片、叶柄、嫩枝及幼果。病原体是梨胶孢，属于担子菌亚门真菌。此病菌无夏孢子阶段，其冬孢子为双细胞并带有生长柄。性孢子器位于叶面表皮下，呈扁球形性孢子为椭圆形或纺锤形。锈孢子器生长在叶背面，形成丛状毛状物，锈孢子呈近圆形、单细胞，颜色为黄色或浅褐色…	Gymnosporangium asiaticum primarily impacts the leaves, petioles, tender branches, and fruits of the papaya plant. The causative agent is Gymnosporangium asiaticum Miyabe ex Yamada, a fungus classified within the Basidiomycota phylum. This pathogen lacks a summer spore stage; its overwintering spores are bicellular with an accompanying growth stalk. Teliospores reside beneath the leaf epidermis as flattened spheres, while basidiospores appear elliptical or fusiform in shape.Spores proliferate on the undersides of leaves, forming tufted hair-like structures; they are nearly round and unicellular with coloration ranging from yellow to light brown…
Chatglm3-6B	木瓜锈病主要危害木瓜的叶片、果实和枝条。其病原体的形态特征包括：菌丝呈粉红色，分支繁殖，分隔泡明显，孢子囊呈椭圆形，内含四个隔膜，孢子囊壁厚，孢子呈椭圆形…	Gymnosporangium asiaticum primarily affects the leaves, fruits, and branches of the papaya plant. The morphological characteristics of this pathogen are as follows: pink hyphae that reproduce through branching; well-defined septa; oval-shaped spore sacs containing four septa; thick walls of the spore sacs; and elliptical spores…
Baichuan2-7B	木瓜锈病主要危害木瓜属的叶片、嫩翘和果实。病原体的形态特征主要包括菌丝体、孢子囊和担孢子，菌丝体呈白色，无隔膜，具有吸收功能，孢子囊成串生长，呈球形或椭圆形，包含多个担孢子，担孢子成熟后从孢子囊逸出，遇到适宜条件可萌发成菌丝体，进一步浸染植物…	Gymnosporangium asiaticum primarily targets the leaves, tender branches, and fruits of the papaya genus. The morphological characteristics of this pathogen predominantly include mycelium, spore sacs, and basidiospores. The mycelium is characterized by its white color, non-septate structure, and absorptive functions. Spore sacs develop in clusters and are typically spherical or oval-shaped, containing multiple basidiospores. Upon maturation, the basidiospores are released from the spore sacs and can germinate into mycelium under favorable conditions, thereby facilitating further infection of the host plant…
Chinese-llama-7B	木瓜锈病是一种由木瓜锈菌引起的真菌病害，可危害木瓜的叶片和果实，导致锈斑和腐烂。其孢子体呈橙黄色，通过风雨传播，对木瓜产量和品质影响严重…	Gymnosporangium asiaticum is a fungal disease caused by the pathogen Puccinia caricae. It affects papaya leaves and fruits, causing rust spots and rot. The orange-yellow spores spread through wind and rain, severely impacting papaya yield and quality…
ChatGLM3-6B-Tuned	木瓜锈病主要危害木瓜的叶片、嫩枝、嫩叶及嫩枝等部位。病原体为真菌性锈菌，简称锈菌。锈菌为黑褐色，分割隔膜，分割隔膜有明显突起，呈网状，其上覆有厚而粗的粉粒。锈菌的锈点为黑褐色，形状为近圆形，每个锈点大小为2毫米，表面有细小的鳞片，每种锈菌锈点仅占就锈菌长度的10%~20%。锈菌的锈点为黑褐色，形状为近圆形，每个锈点大小为2毫米，表面有细小的鳞片，每种锈菌锈点仅占就锈菌长度的10%~20%。锈菌的锈点为黑褐色，形状为近圆形，每个锈点大小…	Gymnosporangium asiaticum primarily targets the leaves, tender branches, young foliage, and tender branches of the papaya plant. This pathogen is classified as a Urediniomycetes, commonly known as Puccinia caricae. The appearance is characterized by a dark brown coloration with segmented septa that are prominently raised, forming a net-like structure covered in thick, coarse powder. The resulting rust spots exhibit a black-brown hue and are nearly round in shape; each spot measures approximately 2 mm in diameter and features tiny scales on its surface. Typically, each type of rust fungus occupies only 10% to 20% of its length with these distinct rust spots. The appearance is characterized by a dark brown coloration with segmented septa that are prominently raised, forming a net-like structure covered in thick, coarse powder. The resulting rust spots exhibit a black-brown hue and are nearly round in shape; each spot measures approximately 2 mm in diameter and features tiny scales on its surface. Typically, each type of rust fungus occupies only 10% to 20% of its length with these distinct rust spots. The appearance is characterized by a dark brown coloration with segmented septa that are prominently raised, forming a net-like structure covered in thick, coarse powder. The resulting rust spots exhibit a black-brown hue and are nearly round in shape; each spot measures approximately…
ChatGLM3-6B- G-EA	木瓜锈病主要危害的叶片、叶柄、嫩枝及果实。其病原体的形态特征是叶柄和嫩枝上的锈斑为圆形或椭圆形，颜色为褐色，病斑后期颜色变深，表面覆盖黑色粉末状锈孢子堆…	Gymnosporangium asiaticum primarily targets the leaves, petioles, tender branches, and fruits of host plants. The morphological characteristics of this pathogen are marked by the presence of rust spots on the petioles and tender branches, which exhibit a circular or oval shape and a brown coloration. In the later stages of infection, the lesions darken in color, and their surfaces become covered with black powdery masses of rust spores…
IPM-AgriGPT	木瓜锈病的病原体是梨胶孢，叶子、叶柄、嫩枝和未成熟的果实由于其柔软的性质而特别脆弱，使它们更容易受到感染。因此，这种疾病主要针对木瓜植物的茎、枝、叶和果实。这种病原体形态为双细胞并带有生长柄,锈孢子近乎圆形或椭圆形，颜色为黄色或浅棕色…	Gymnosporangium asiaticum is caused by the *pathogen Gymnosporangium asiaticum Miyabe ex Yamada.* The leaves, petioles, young branches, and immature fruits being particularly vulnerable due to their tender nature, making them more susceptible to infection. Therefore, this disease primarily targets the stems, branches, leaves, and fruits of the papaya plant. This pathogen is characterized by a two-celled structure with growth stalks. *The aecidiospore* of this pathogen are nearly round or oval-shaped, yellow or light brown in color…**

References

1. Wang, H.; Wu, H.; Zhu, H.; Miao, Y.; Wang, Q.; Qiao, S.; Zhao, H.; Chen, C.; Zhang, J. A residual LSTM and Seq2Seq neural network based on GPT for Chinese rice-related question and answer system. Agriculture; 2022; 12, 813. [DOI: https://dx.doi.org/10.3390/agriculture12060813]

2. Zhao, X.; Chen, B.; Ji, M.; Wang, X.; Yan, Y.; Zhang, J.; Liu, S.; Ye, M.; Lv, C. Implementation of Large Language Models and Agricultural Knowledge Graphs for Efficient Plant Disease Detection. Agriculture; 2024; 14, 1359. [DOI: https://dx.doi.org/10.3390/agriculture14081359]

3. Gao, R.; Dong, Z.; Wang, Y.; Cui, Z.; Ye, M.; Dong, B.; Lu, Y.; Wang, X.; Song, Y.; Yan, S. Intelligent cotton Pest and disease detection: Edge computing solutions with transformer technology and knowledge graphs. Agriculture; 2024; 14, 247. [DOI: https://dx.doi.org/10.3390/agriculture14020247]

4. Xie, B.; Su, Q.; Tang, B.; Li, Y.; Yang, Z.; Wang, J.; Wang, C.; Lin, J.; Li, L. Combining Neural Architecture Search with Knowledge Graphs in Transformer: Advancing Chili Disease Detection. Agriculture; 2023; 13, 2025. [DOI: https://dx.doi.org/10.3390/agriculture13102025]

5. Tang, W.; Wen, X.; Hu, Z. Named Entity Recognition for Crop Diseases and Pests Based on Gated Fusion Unit and Manhattan Attention. Agriculture; 2024; 14, 1565. [DOI: https://dx.doi.org/10.3390/agriculture14091565]

6. Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; Wu, X. Unifying Large Language Models and Knowledge Graphs: A Roadmap. arXiv; 2023; arXiv: 2306.08302[DOI: https://dx.doi.org/10.1109/TKDE.2024.3352100]

7. Brown, T.B. Language models are few-shot learners. arXiv; 2020; arXiv: 2005.14165

8. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F. Llama: Open and efficient foundation language models. arXiv; 2023; arXiv: 2302.13971

9. Le Scao, T.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M. Bloom: A 176b-parameter open-access multilingual language model. arXiv; 2023; arXiv: 2211.05100

10. Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D. Emergent abilities of large language models. arXiv; 2022; arXiv: 2206.07682

11. Hendy, A.; Abdelrehim, M.; Sharaf, A.; Raunak, V.; Gabr, M.; Matsushita, H.; Kim, Y.J.; Afify, M.; Awadalla, H.H. How good are gpt models at machine translation? a comprehensive evaluation. arXiv; 2023; arXiv: 2302.09210

12. Park, C.; Koo, S.; Kim, G.; Lim, H. Towards Harnessing the Most of ChatGPT for Korean Grammatical Error Correction. Appl. Sci.; 2024; 14, 3195. [DOI: https://dx.doi.org/10.3390/app14083195]

13. Wang, Y.; Zhang, Z.; Wang, R. Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method. arXiv; 2023; arXiv: 2305.13412

14. Snæbjarnarson, V.; Símonarson, H.B.; Ragnarsson, P.O.; Ingólfsdóttir, S.L.; Jónsson, H.P.; Þorsteinsson, V.; Einarsson, H. A Warm Start and a Clean Crawled Corpus—A Recipe for Good Language Models. arXiv; 2022; arXiv: 2201.05601

15. BT, B.; Chen, J.-M. Performance Assessment of ChatGPT versus Bard in Detecting Alzheimer’s Dementia. Diagnostics; 2024; 14, 817. [DOI: https://dx.doi.org/10.3390/diagnostics14080817] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38667463]

16. Guo, T.; Nan, B.; Liang, Z.; Guo, Z.; Chawla, N.; Wiest, O.; Zhang, X. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. Adv. Neural Inf. Process. Syst.; 2023; 36, pp. 59662-59688.

17. Plevris, V.; Papazafeiropoulos, G.; Jiménez Rios, A. Chatbots put to the test in math and logic problems: A comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google bard. AI; 2023; 4, pp. 949-969. [DOI: https://dx.doi.org/10.3390/ai4040048]

18. Shen, J.; Tenenholtz, N.; Hall, J.B.; Alvarez-Melis, D.; Fusi, N. Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains. arXiv; 2024; arXiv: 2402.05140

19. Shutske, J.M. Harnessing the Power of Large Language Models in Agricultural Safety & Health. J. Agric. Saf. Health; 2023; 29, pp. 205-224.

20. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. Proceedings of the International Conference on Machine Learning; Long Beach, CA, USA, 9–15 June 2019; pp. 2790-2799.

21. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv; 2021; arXiv: 2106.09685

22. Zhang, Q.; Chen, M.; Bukharin, A.; Karampatziakis, N.; He, P.; Cheng, Y.; Chen, W.; Zhao, T. AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv; 2023; arXiv: 2303.10512

23. Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient finetuning of quantized LLMs. arXiv; 2023; arXiv: 2305.14314

24. Zhang, Y.; Wang, Z.; He, Z.; Li, J.; Mai, G.; Lin, J.; Wei, C.; Yu, W. BB-GeoGPT: A framework for learning a large language model for geographic information science. Inf. Process. Manag.; 2024; 61, 103808. [DOI: https://dx.doi.org/10.1016/j.ipm.2024.103808]

25. Wang, H.; Liu, C.; Xi, N.; Qiang, Z.; Zhao, S.; Qin, B.; Liu, T. Huatuo: Tuning llama model with chinese medical knowledge. arXiv; 2023; arXiv: 2304.06975

26. Yang, H.; Li, S.; Gonçalves, T. Enhancing Biomedical Question Answering with Large Language Models. Information; 2024; 15, 494. [DOI: https://dx.doi.org/10.3390/info15080494]

27. Huang, Q.; Tao, M.; Zhang, C.; An, Z.; Jiang, C.; Chen, Z.; Wu, Z.; Feng, Y. Lawyer llama technical report. arXiv; 2023; arXiv: 2305.15062

28. Alghamdi, H.M.; Mostafa, A. Towards Reliable Healthcare LLM Agents: A Case Study for Pilgrims during Hajj. Information; 2024; 15, 371. [DOI: https://dx.doi.org/10.3390/info15070371]

29. Silva, B.; Nunes, L.; Estevão, R.; Aski, V.; Chandra, R. GPT-4 as an agronomist assistant? Answering agriculture exams using large language models. arXiv; 2023; arXiv: 2310.06225

30. Li, J.; Xu, M.; Xiang, L.; Chen, D.; Zhuang, W.; Yin, X.; Li, Z. Foundation models in smart agriculture: Basics, opportunities, and challenges. Comput. Electron. Agric.; 2024; 222, 109032. [DOI: https://dx.doi.org/10.1016/j.compag.2024.109032]

31. De Clercq, D.; Nehring, E.; Mayne, H.; Mahdi, A. Large language models can help boost food production, but be mindful of their risks. arXiv; 2024; arXiv: 2403.15475[DOI: https://dx.doi.org/10.3389/frai.2024.1326153]

32. Zhao, B.; Jin, W.; Del Ser, J.; Yang, G. ChatAgri: Exploring potentials of ChatGPT on cross-linguistic agricultural text classification. Neurocomputing; 2023; 557, 126708. [DOI: https://dx.doi.org/10.1016/j.neucom.2023.126708]

33. Yang, S.; Liu, Z.; Mayer, W. ShizishanGPT: An Agricultural Large Language Model Integrating Tools and Resources. arXiv; 2024; arXiv: 2409.13537

34. Zhang, X.; Yang, Q. Self-qa: Unsupervised knowledge guided language model alignment. arXiv; 2023; arXiv: 2305.11952

35. Chung, J.J.Y.; Kamar, E.; Amershi, S. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. arXiv; 2023; arXiv: 2306.04140

36. Honovich, O.; Scialom, T.; Levy, O.; Schick, T. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv; 2022; arXiv: 2212.09689

37. Li, S.; Yang, C.; Yin, Y.; Zhu, X.; Cheng, Z.; Shang, L.; Jiang, X.; Liu, Q.; Yang, Y. Autoconv: Automatically generating information-seeking conversations with large language models. arXiv; 2023; arXiv: 2308.06507

38. Du, Z.; Qian, Y.; Liu, X.; Ding, M.; Qiu, J.; Yang, Z.; Tang, J. Glm: General language model pretraining with autoregressive blank infilling. arXiv; 2021; arXiv: 2103.10360

39. Hsieh, C.-Y.; Li, C.-L.; Yeh, C.-K.; Nakhost, H.; Fujii, Y.; Ratner, A.; Krishna, R.; Lee, C.-Y.; Pfister, T. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv; 2023; arXiv: 2305.02301

40. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst.; 2022; 35, pp. 24824-24837.

41. Gu, Y.; Dong, L.; Wei, F.; Huang, M. Knowledge distillation of large language models. arXiv; 2023; arXiv: 2306.08543

42. Agarwal, R.; Vieillard, N.; Stanczyk, P.; Ramos, S.; Geist, M.; Bachem, O. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv; 2023; arXiv: 2306.13649

43. Huang, Y.; Chen, Y.; Yu, Z.; McKeown, K. In-context learning distillation: Transferring few-shot learning ability of pre-trained language models. arXiv; 2022; arXiv: 2212.10670

44. Jiang, Y.; Chan, C.; Chen, M.; Wang, W. Lion: Adversarial distillation of proprietary large language models. arXiv; 2023; arXiv: 2305.12870

45. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv; 2023; arXiv: 1706.03762

46. Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv; 2021; arXiv: 2104.08691

47. GLM, T.; Zeng, A.; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Rojas, D.; Feng, G.; Zhao, H.; Lai, H. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv; 2024; arXiv: 2406.12793

48. Cui, Y.; Yang, Z.; Yao, X. Efficient and effective text encoding for chinese llama and alpaca. arXiv; 2023; arXiv: 2304.08177

49. Li, H.; Zhang, Y.; Koto, F.; Yang, Y.; Zhao, H.; Gong, Y.; Duan, N.; Baldwin, T. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv; 2023; arXiv: 2306.09212

50. Yang, S.; Yuan, Z.; Li, S.; Peng, R.; Liu, K.; Yang, P. GPT-4 as Evaluator: Evaluating Large Language Models on Pest Management in Agriculture. arXiv; 2024; arXiv: 2403.11858

51. Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv; 2023; arXiv: 2303.16634

52. Chen, X.; Li, L.; Chang, L.; Huang, Y.; Zhao, Y.; Zhang, Y.; Li, D. Challenges and Contributing Factors in the Utilization of Large Language Models (LLMs). arXiv; 2023; arXiv: 2310.13343

53. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv; 2023; arXiv: 2311.05232[DOI: https://dx.doi.org/10.1145/3703155]

54. Feng, M.-s.; Jian, Z. On leaf nutrition DRIS diagnosis of Eucalyptus grandis. Sichuan Nongye Daxue Xuebao; 2003; 21, pp. 303-307.

55. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst.; 2020; 33, pp. 9459-9474.

56. Verma, G.; Choi, M.; Sharma, K.; Watson-Daniels, J.; Oh, S.; Kumar, S. Mysterious Projections: Multimodal LLMs Gain Domain-Specific Visual Capabilities Without Richer Cross-Modal Projections. arXiv; 2024; arXiv: 2402.16832

57. Wei, X.; Wei, H.; Lin, H.; Li, T.; Zhang, P.; Ren, X.; Li, M.; Wan, Y.; Cao, Z.; Xie, B. Polylm: An open source polyglot large language model. arXiv; 2023; arXiv: 2307.06018

Word count: 10693

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Traditional pest and disease management methods are inefficient, relying on agricultural experts or static resources, making it difficult to respond quickly to large-scale outbreaks and meet local needs. Although deep learning technologies have been applied in pest and disease management, challenges remain, such as the dependence on large amounts of manually labeled data and the limitations of dynamic reasoning. To address these challenges, this study proposes IPM-AgriGPT (Integrated Pest Management—Agricultural Generative Pre-Trained Transformer), a Chinese large language model specifically designed for pest and disease knowledge. The proposed Generation-Evaluation Adversarial (G-EA) framework is used to generate high-quality question–answer corpora and combined with Agricultural Contextual Reasoning Chain-of-Thought Distillation (ACR-CoTD) and low-rank adaptation (LoRA) techniques further optimizes the base model to build IPM-AgriGPT. During the evaluation phase, this study designed a specialized benchmark for the agricultural pest and disease domain, comprehensively assessing the performance of IPM-AgriGPT in pest management tasks. Experimental results show that IPM-AgriGPT achieved excellent evaluation scores in multiple tasks, demonstrating its great potential in agricultural intelligence and pest management.

Details

Title

IPM-AgriGPT: A Large Language Model for Pest and Disease Management with a G-EA Framework and Agricultural Contextual Reasoning

Author

Zhang, Yuqin¹

; Fan, Qijie¹; Chen, Xuan¹; Li, Min²; Zhao, Zeying³; Li, Fuzhong⁴; Guo, Leifeng¹

¹ School of Software, Shanxi Agricultural University, Jinzhong 030801, China; [email protected] (Y.Z.); [email protected] (Q.F.); [email protected] (X.C.); Agricultural Information Institute of Chinese Academy of Agricultural Sciences, Beijing 100081, China; [email protected]
² Agricultural Information Institute of Chinese Academy of Agricultural Sciences, Beijing 100081, China; [email protected]
³ Guizhou Agricultural Science and Technology Information Institute, Guiyang 550006, China; [email protected]
⁴ School of Software, Shanxi Agricultural University, Jinzhong 030801, China; [email protected] (Y.Z.); [email protected] (Q.F.); [email protected] (X.C.)

First page

566

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

22277390

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/math13040566

ProQuest document ID

3171097514

IPM-AgriGPT: A Large Language Model for Pest and Disease Management with a G-EA Framework and Agricultural Contextual Reasoning

Jump to:

Full Text

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset Collection

3.1.1. Web

3.1.2. Database

3.1.3. Text

3.2. Dataset Preprocessing

3.2.1. Corpus Construction

3.2.2. Generation-Evaluation Adversarial Framework

3.3. Proposed Method

3.3.1. Agricultural Contextual Reasoning CoTD

3.3.2. SFT Based on Pest and Disease Data

3.3.3. LoRA

3.4. Experimental Setup

3.4.1. Hardware Platform

3.4.2. Base Model

3.4.3. Pre-Training

3.4.4. Fine-Tuning

3.5. Evaluation Metrics

3.5.1. Baseline

3.5.2. Objective Task

3.5.3. Subjective Task

4. Results

4.1. Evaluation Results

4.2. Ablation Experiment Results

5. Discussion

5.1. Results Analysis

5.2. Limitation

5.3. Future Work

6. Conclusions

Abstract

Details

Suggested sources