Content area
Societal Impact Statement
Investigation of farmers', consumers', and other stakeholders' trait preferences is vital for the adoption and impact of improved crop varieties. While qualitative research methods are known to increase the depth and scope of information from respondents, only 5% of previous trait preference studies used qualitative data in their analyses. We show that AI‐based natural language processing, particularly GPTs, is both a time and cost‐effective mechanism for accurately analyzing open‐ended trait preference data. This will contribute to the selection and prioritization of breeding targets to better meet end‐user needs, with implications for food security and health outcomes globally.
INTRODUCTION
Crop breeding programs are fundamentally important for improving the productivity and quality of crops globally, playing a critical role in enhancing food security in a rapidly changing climate. Their success is dependent on setting breeding goals important to particular groups of stakeholders, including farmers, processors, and consumers, for a given crop in a given set of environmental conditions. Programs must consider multiple levels of goal setting, from high-level decisions on impact aims in various market segments (Donovan et al., 2022) to decisions on which traits to select for.
These decisions require identification of stakeholders and their preferences, as a mismatch between traits under selection in breeding programs and the needs of end-users has been linked to lower rates of crop variety adoption (Asrat et al., 2010; Wale & Yalew, 2007). In the public sector, especially for crop breeding programs focused on food security, trait priorities cannot be informed by marketing data and must therefore be informed by targeted studies to understand priority traits for end-users and consumers (Ragot et al., 2018). Trait preference studies facilitate the process of identifying traits, investigating their relative importance for end-users, and quantifying the target for each trait for use in breeding programs (Ragot et al., 2018), information that is often collected through surveys, interviews, choice experiments, and other social science research methods (Occelli et al., 2024).
Despite the importance of trait preference studies for the success of breeding programs, the methods utilized are not standardized or consistently applied. In social science studies, mixed methods (quantitative and qualitative methods used together) are the gold standard for capturing rich information from respondents to bridge the “how” and “why” questions (Caruth, 2013). This includes using results from one method to confirm, expand upon, or investigate the results of another. Researchers also implement these methods sequentially, such that quantitative methods are followed by qualitative to further explain the results obtained in the former (Ivankova et al., 2006). Alternatively, qualitative methods can be used to inform quantitative, providing confirmation of qualitative survey responses, reducing bias in quantitative survey design, and potentially lowering costs (Maltseva, 2016).
Mixed methods application in trait prioritization studies is not evenly applied. A global analysis of trait preference studies shows a majority of studies still predominantly use quantitative close-ended, survey-based research tools to capture respondent trait preferences (Occelli et al., 2024). The same study found that a third of research methods that use direct questions to respondents are qualitative focus group discussions (FGDs), indicating researchers are collecting responses to open-ended questions. However, only 5% of studies actually use responses to open-ended questions, signaling a data analysis bottleneck (Occelli et al., 2024). Furthermore, Occelli et al. (2024) called for crop ontology definitions more closely linked to crop trait preference studies to enable higher cross-comparability. A concerted global effort initiated by the CGIAR in 2008 to construct a list of defined breeder traits and variables () is continuously revised and updated by a community of practice as new trait terms are added and refined (Arnaud et al., 2020). Analyzing open-ended questions from trait preference studies and enabling their integration into crop ontology terms is potentially facilitated by recent advances in artificial intelligence (AI) related to natural language processing (NLP). These techniques have shown potential to dramatically reduce the required time investment for analysis, which may allow programs to more effectively use qualitative data for crop trait preference studies.
The advent of large language models (LLMs) has yielded substantial advances in NLP approaches. LLMs, which are pretrained on large amounts of unlabeled text data (Radford et al., 2018), have been shown to achieve competitive performance on a variety of benchmark tasks given few labeled examples and without the need for fine-tuning (Brown et al., 2020). Generative pretrained transformers (GPTs) are a specific type of LLM based on the transformer architecture (Radford et al., 2018). The series of GPT models from OpenAI are state of the art, affordable, and highly accessible, with suites of tools available to implement a variety of common NLP tasks. For the specific task of data labeling, GPT-3, with 175 billion parameters, has been shown to be an effective annotator given few to zero labeled examples, lowering both the cost and time necessary relative to human labelers (Ding et al., 2023; Wang et al., 2021). Applied to different domains, research is mixed on whether ChatGPT models can outperform fine-tuned models or match human labeling accuracy (Kuzman et al., 2023; Zhao et al., 2023; Zhu et al., 2023; Ziems et al., 2024).
LLMs have been tested for their ability to understand the agricultural domain, specifically. When answering questions on agriculture exams, GPT-4 achieved passing scores and, in one case, outperformed humans (Silva et al., 2023). LLMs have also been shown to be effective for answering farmer queries, giving advice for pest management, and agriculture-related question and answer generation (Balaguer et al., 2024; Didwania et al., 2024; Yang et al., 2024). Tzachor et al. (2023) discussed the implementation of GPT-4 for agricultural extension services, highlighting its shortcomings for providing advice given a case study on cassava farming in Nigeria. Zhao et al. (2023) applied ChatGPT models to agricultural text classification, showing that it can outperform fine-tuned models under certain conditions.
Here, we investigate the potential for using NLP, including OpenAI's GPT models and open-source models Dolly and Mistral, on responses to open-ended questions in crop trait preference studies. Data were collected as part of The Cassava Monitoring Survey in Nigeria in 2015 (Abdoulaye et al., 2018). In this dataset, both open- and close-ended trait preference questions were asked to the same respondents. This makes for a unique and highly valuable dataset to not only test NLP on open-ended responses but also to compare open- and close-ended question types to gauge information gains and losses in both data collection methods.
The specific research objectives are
- to develop an information extraction process to analyze open-ended text for trait prioritization studies,
- to compare the concordance between trait preference terms extracted from responses to open-ended questions and predefined terms from close-ended questions, and
- to compare the coverage of terms from open- and close-ended question types against trait ontology terms.
We found that the OpenAI GPT models were able to accurately perform multilabel text classification of trait preferences without the need for fine-tuning and minimal to no examples. Extracting labels from the open-ended responses led to a greater diversity of traits identified, as well as information on their social functions. In addition, we found that similar proportions of labels identified in open-ended questions were represented in the Cassava Crop Ontology (Agbona et al., 2023), as compared to close-ended question labels.
MATERIALS AND METHODS
Data collection and question types
The Cassava Monitoring Survey
The data used in this study were collected as part of The Cassava Monitoring Survey () in Nigeria in 2015 (Abdoulaye et al., 2018). This was a nationally representative survey covering 16 states across four geopolitical regions, which together accounted for over 80% of total cassava production. Overall, 625 households were interviewed in each region, for a total of 2500 households (Wossen et al., 2017).
Open-ended questions
Respondents were asked to give their trait preferences for cassava in three open-ended questions. These asked for their first most liked, second most liked, and third most liked traits, in general, and not in reference to particular varieties. The responses are single sentences and recorded in open-text format, for example, “I like cassava root that makes good quality gari and akpu for me to feed my family.” Each response may contain more than one trait, as in the previous example. Across the three open-ended questions and all households surveyed, but excluding NAs, there are a total of 6971 responses.
Close-ended questions
Respondents were also asked for their most liked traits about cassava varieties that they had previously grown in several close-ended questions. Each respondent was asked to give their first, second, and third most liked “production” trait about each variety from a given list of 13 options or “Other.” Similarly, the same three questions were asked for “processing” and “consumption” traits from a list of 13 or 12 traits, respectively, or “Other.”
Extracting labels from open-ended question responses and performing classification using GPT models
Workflow
We developed a workflow for processing the open-ended responses (Figure 1). This involved using NLP both to determine the set of labels which appeared in the open-ended question responses and to classify each response with the appropriate label(s). The titles of each subsection below correspond to a step in the workflow presented in Figure 1.
[IMAGE OMITTED. SEE PDF]
Data preprocessing for label extraction
Data preprocessing of the open-ended question responses included normalizing the dataset by removing filler words and punctuation, making all text lowercase, and lemmatizing each word. This was completed using version 3.5.3 of the spaCy package available in Python (Honnibal et al., 2020). We further removed all words which only appeared once, corrected common spelling errors, and replaced synonyms of frequently used words. This step yielded a cleaned list of lowercase and lemmatized strings for each response. These lists of strings were then used to extract labels.
Extracting open-ended labels
We determined the sets of bigrams and trigrams present in the dataset and ranked them by likelihood ratio (Manning & Schutze, 1999), using functions provided in the NLTK package, version 3.7, in Python (Bird et al., 2009). The likelihood ratio is calculated as the log of the likelihood that the following word is independent of the previous word relative to the likelihood that they are dependent (Manning & Schutze, 1999). Similarly, we used these functions to determine the common collocations for specific words which occurred frequently in the responses. Meaningful bigrams and trigrams were concatenated, while frequent words that no longer had meaningful associations were removed. This process was repeated, iteratively, until the bigrams and trigrams ceased yielding informative or new associations. This allowed us to determine the meaningful two- and three-word collocations with the highest likelihood to be present in the open-ended responses, providing a list of potential labels. While this is less efficient than other keyphrase extraction and ranking methods, iteratively searching until only spurious associations remained allowed for greater control over accurate extraction of terms. As LLMs have not been applied to the cassava trait preference domain, we wanted to ensure an exhaustive list of labels, preserving subtle differences in meaning. Further, the likelihood ratio test is independent of the relative frequencies of each n-gram or their semantic content, and so is not subject to relative or context dependent importance metrics.
Multilabel text classification
Once the set of labels had been determined, the following step was to perform multilabel text classification on the open-ended responses. This was achieved using version 0.6.4 of the spacy-llm package, which allows users to access and implement LLMs for a variety of tasks within spaCy pipelines. This provides a user-friendly tool, which precludes the need for prompt engineering and facilitates the use of a range of different models. More specifically, we used this tool to access the OpenAI API and employ the gpt-3.5-turbo-0125 and gpt-4o-mini-2024-07-18 models to perform the classification task. This version of GPT-3.5 was trained up through August 2021, and GPT-4o-mini was trained on inputs up through September 2023. We further tested two free and open-source models, namely, Dolly (Conover et al., 2023) and Mistral (Jiang et al., 2023), each with seven billion parameters. These were accessed via Hugging Face () for model versions dolly-v2-7b and Mistral-7B-v0.1.
The classification task was performed on the unprocessed, open-text trait preference responses. We used both zero-shot and one-shot learning, electing to test the applicability of these models to the agricultural domain without the need for fine-tuning, which requires larger datasets, manually labeled training data, and greater computational resources. For zero-shot learning, the only information provided to each GPT model was the prompt and the set of labels. For one-shot learning, the model was also given labeled examples. To find examples for each label, we first manually tagged all responses in the dataset with the set of open-ended labels extracted in Section 2.2.3. We then sampled a single observation for each of the 72 total labels to get one example per class. As these tools would be applied to unlabeled data, finding a single example from each class would not be achievable directly. To circumvent this issue, we further tested the implementation of both k-means clustering and prelabeling with GPT-3.5 to select examples.
We used version 1.2.2 of the scikit-learn module (Pedregosa et al., 2011) in Python to implement k-means clustering (Arthur & Vassilvitskii, 2007; Forgy, 1965; Lloyd, 1982; MacQueen, 1967). This unsupervised method groups data based on similarity by minimizing the sum of the squared Euclidean distances between data points and cluster centroids. The data were first preprocessed as in Section 2.2.2, and term frequency-inverse document frequency (TF-IDF) vectorization was applied (Spärk Jones, 1972). k-Means clustering was then implemented on the TF-IDF features using Lloyd's algorithm (Forgy, 1965; Lloyd, 1982) and initializing centroids using k-means++ (Arthur & Vassilvitskii, 2007), where k was set to match the number of labels previously extracted from the open-ended responses in Section 2.2.3. We then randomly sampled a single response from each cluster and manually labeled it.
In addition to clustering, we also tested using GPT-3.5 to identify examples. As GPT-3.5 has a significantly lower financial cost than the other OpenAI models, it was reasonable to apply it to the entire dataset using zero-shot learning, thus “prelabeling” the data. Classifying the 6971 responses cost about $10 USD and allowed us to randomly sample a single response that had been tagged with each label. These were then manually relabeled, as needed.
Performance evaluation
To evaluate the performance of the NLP workflow, we evaluated both the coverage of the open-ended labels and accuracy of the multilabel text classification. Coverage was measured as the proportion of responses that had been tagged with at least one label.
To assess model performance, and to compare different labeling schemes, we compared the manually tagged data against classification by the GPT models after removing the examples selected for one-shot learning. For each trait, the precision (1), recall (2), and F1 scores (3) were calculated, as outlined below, where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. The overall performance of each model was evaluated by calculating the micro-precision, micro-recall, and micro-F1 scores, calculated as the precision, recall, and F1 scores across all traits, respectively.
Each LLM and zero and one-shot learning scheme was repeated for five iterations to derive 95% confidence intervals for each evaluation metric.
Label and ontology comparisons
Finally, we compared the set of labels extracted from the open-ended responses to the predetermined close-ended labels, both of which were further compared to the existing set of ontology terms (). For each label in a given set, any corresponding term(s) in the comparison set was determined, and the label was classified as “represented,” “inexact or partial representation,” “not represented,” or “nontrait.” “Represented” labels had exact or synonymous matches, for example, “store well underground” and “in-ground storability.” “Inexact or partial representation” refers to labels which had analogous or overlapping matches, for example, “quality gari” and “taste for gari.” “Not represented” refers to labels that are cassava traits but had neither exact nor partial matches. “Nontrait” labels were separated from the “not represented” category, as the close-ended labels and crop ontology are explicitly defined as sets of crop traits.
RESULTS
GPT models can be used to classify responses to open-ended crop trait preference questions with minimal labeled examples and no fine-tuning
Open AI's GPT models classified the responses to the open-ended questions by their corresponding label(s) with varying levels of success across modeling schemes (Figure 2). Zero-shot learning does not use any labeled data, while Manual One-Shot uses manually labeled data, Clustered One-Shot learning uses labeled data generated by manual labeling of k-means clustering, and Relabeled One-Shot uses labeled data generated by manual relabeling of GPT-3.5. With zero-shot learning, Dolly and Mistral achieved very low classification accuracy, while GPT-3.5 achieved an average micro-F1 score of 0.617. GPT-3.5's micro-precision and micro-recall were 0.578 and 0.665, respectively, indicating the tendency of the model to include false positives over excluding a correct label. GPT-4o-mini outperformed GPT-3.5 and substantially outperformed Dolly and Mistral, with an average micro-F1 score of 0.664.
[IMAGE OMITTED. SEE PDF]
One-shot learning outperformed zero-shot learning in all cases except for Dolly, with slight differences between providing examples identified through manual annotation, k-means clustering, and relabeled examples identified by GPT-3.5 (Figure 2). For one-shot learning with GPT-4o-mini, the micro-F1 score for clustered examples was 0.842 compared to 0.833 for relabeled examples. Given these slight differences, clustering appeared to be a viable option for identifying and labeling example data without incurring extra cost. As with zero-shot learning, the OpenAI models using one-shot learning had micro-recall values that were significantly higher than the micro-precision, capturing true positives, but erroneously including false positives. Conversely, Mistral had micro-precision values which were similar to GPT-3.5 and GPT-4o-mini but low micro-recall values, indicating higher numbers of false positives.
The F1 scores for individual labels increased the most for those that appeared more frequently in the open-ended responses, as shown in Figures S1 and S2. For example, the F1 score for “tuber size” more than doubled from GPT-3.5 with zero-shot learning to GPT-4o-mini with one-shot learning. In addition, for one-shot learning with examples determined through clustering and sampling of labeled data by GPT-3.5, certain infrequent labels were still not captured in the example data.
Table 1 shows the percentage of open-ended responses that were tagged with at least one label. Over 95% of the observations for the most liked trait were labeled by each GPT model, indicating that the set of labels extracted from the open-ended responses supplied sufficient coverage of the information contained within the responses. GPT-3.5 achieves higher coverage of the responses than GPT-4o-mini but has a significantly higher rate of false positives, which is likely to account for this difference.
TABLE 1 A summary table of the average percentage of open-ended responses tagged with at least one label by GPT-3.5 using zero-shot learning, and by GPT-4o-mini, using one-shot learning with examples identified through
| GPT-3.5 zero-shot | GPT-4o-mini Clustered One-Shot | Total observations | |
| Trait 1 | 2455 (98.95%) | 2378 (95.33%) | 2480 |
| Trait 2 | 2368 (98.77%) | 2280 (94.12%) | 2397 |
| Trait 3 | 2040 (97.18%) | 1931 (90.51%) | 2094 |
Open-ended questions provide comparatively higher diversity of data on trait preferences
We compared the precoded close-ended labels with those extracted from the open-ended responses (Table S1). The close-ended questions had 38 distinct labels across the processing, production, and consumption categories, as well as an option of “Other” for each, and 72 total labels were extracted from the open-ended responses. Six of the close-ended labels were not found by analyzing the open-ended question responses, while 28 open-ended labels had no corresponding close-ended label. Of those 28, which comprised nearly 40% of all open-ended labels (Figure 3), 16 were crop traits and 12 were additional characteristics, including common indicators of the social functions of traits, such as “commercial purpose” and “food security.”
[IMAGE OMITTED. SEE PDF]
We compared the relative frequencies of open-ended labels in the open-ended responses (Figure 4). Figure 4b shows the frequencies of traits that had no corresponding label in the close-ended questions. These traits had relative frequencies below 5%, as did the majority of labels, indicating that open-ended responses may better inform which niche traits to include in close-ended survey design. It is also worth noting that, for the most liked trait (“Trait 1”), “earn income,” a nontrait label (Figure 4c), is the fourth most frequently occurring in the open-ended responses, only after “high yield roots,” “tuber size,” and “quality gari.”
[IMAGE OMITTED. SEE PDF]
Trait ontology terms map as closely to open-ended question labels as to close-ended question labels
Labels from open- and close-ended questions have similar relative representations in the Cassava Crop Ontology (Figure 5). Among the 60 open-ended trait labels, nine do not map to an existing ontology trait. This once again emphasizes the potential of using open-ended responses to discover traits. For example, one trait that appears in open-ended responses but is absent from the trait ontology is “harvest gradually,” which could have an important social function but has not been thus far included in the ontology terms. Further, 40 of the open- and 23 of the close-ended labels map inexactly or partially. Tables S2 and S3 outline the correspondence between ontology terms and each open- and close-ended label.
[IMAGE OMITTED. SEE PDF]
DISCUSSION
We show that NLP, particularly the application of GPT models, is a viable and accurate mechanism for extracting labels and performing multilabel text classification for open-text data in crop trait preference studies. OpenAI's GPT-4o-mini model achieved high classification accuracy with no fine-tuning and minimal labeled examples, proving to be a time-efficient approach. This model is also less expensive than GPT-3.5, making it cost-efficient as well. While these models were only tested for cassava breeding in Nigeria, their ability to identify and accurately classify crop traits in context, without fine-tuning to the specific domain, is indicative of their potential for generalizability to other crops in other contexts, as well as broader usage for other NLP tasks within the trait preference domain.
New tools for performing NLP tasks are being developed rapidly, providing opportunities to improve on the techniques implemented here. GPT models and topic modeling may also be used to extract labels, which would further increase the efficiency of that process as well. While the open-source models tested here did not achieve comparable performances to the OpenAI models, ongoing efforts to provide open-source state-of-the-art LLMs may further improve their efficacy. In these cases, the classification accuracy may be increased through fine-tuning for agriculture or crop trait domains, or by retrieval augmented generation (RAG) providing the models with additional information, such as the definition of each label present within the crop ontology. However, it is worth noting that while the open-source models are free to use, it is generally recommended to implement them on a GPU machine, which may not be accessible to all users.
By using these tools to analyze open-ended questions, we compared traits important to stakeholders with those included in the Cassava Crop Ontology, finding points of mismatch between the two, which is likely to limit the use of trait preference research for informing breeding targets for crop improvement. Extracting labels from open-ended responses may aid in establishing a larger set of traits, expanding the ontology terms and facilitating cohesion between trait preferences and breeding targets. Additionally, as the ontology grows, the label extraction process may not be necessary, as those terms may be used as a knowledge base for named entity recognition and linking.
We also saw an information gain using open-ended questions, relative to the close-ended questions, allowing us to obtain a set of labels that both largely encompassed the close-ended labels and included new information. This provides greater depth of knowledge and guiding information for developing close-ended surveys, particularly in the curation of niche traits. The efficiency of analysis also allows for larger scale trait prioritization studies that make use of open-text questions, increasing the scope of what can be collected and analyzed. To capitalize on this information to better align breeding programs according to stakeholder needs, this workflow may be further applied to additional datasets to test its applicability to other crops and types of qualitative data.
ACKNOWLEDGMENTS
We acknowledge the support and funding of the CGIAR Excellence in Breeding (EiB) Platform and the Bill and Melinda Gates Foundation's investment INV041105. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant DGE-2139899. We thank Jing Yi for her feedback and guidance on data analysis.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflict of interest.
DATA AVAILABILITY STATEMENT
The Cassava Monitoring Survey data can be downloaded from . The Cassava Crop Ontology can be found at . The code and manual annotations can be found at .
Abdoulaye, T., Assfaw, T., Manyong, V., & Rabbi, I. (2018). The cassava monitoring survey (CMS) in Nigeria household and plot level data. International Institute of Tropical Agriculture (IITA). https://doi.org/10.25502/20180627/0915/AT
Agbona, A., Menda, N., Egesi, C., Kawuki, R., Bakare, M., Laporte, M. A., Cooper, L., Arnaud, E., De Souza, K., Manner, R., Van Etten, J., Teeken, B., & Kulakow, P. (2023). Planteome/CO_334‐cassava‐traits: Updating release to generate DOI—For Planteome Release V5. Zenodo. https://doi.org/10.5281/ZENODO.8253593
Arnaud, E., Laporte, M.‐A., Kim, S., Aubert, C., Leonelli, S., Miro, B., Cooper, L., Jaiswal, P., Kruseman, G., Shrestha, R., Buttigieg, P. L., Mungall, C. J., Pietragalla, J., Agbona, A., Muliro, J., Detras, J., Hualla, V., Rathore, A., das, R. R., … King, B. (2020). The Ontologies Community of Practice: A CGIAR initiative for big data in Agrifood systems. Patterns, 1, 100105. https://doi.org/10.1016/j.patter.2020.100105
Arthur, D., & Vassilvitskii, S. (2007). K‐means++: The advantages of careful seeding. Proceedings of the Annual ACM‐SIAM Symposium on Discrete Algorithms, 8, 1027–1035. https://doi.org/10.1145/1283383.1283494
Asrat, S., Yesuf, M., Carlsson, F., & Wale, E. (2010). Farmers' preferences for crop variety traits: Lessons for on‐farm conservation and technology adoption. Ecological Economics, 69(12), 2394–2401. https://doi.org/10.1016/j.ecolecon.2010.07.006
Balaguer, A., Benara, V., de Freitas Cunha, R. L., de Moura Estevão Filho, R., Hendry, T., Holstein, D., Marsman, J., Mecklenburg, N., Malvar, S., Nunes, L. O., Padilha, R., Sharp, M., Silva, B., Sharma, S., Aski, V., & Chandra, R. (2024). Rag vs fine‐tuning: Pipelines, tradeoffs, and a case study on agriculture. arXiv. https://doi.org/10.48550/arXiv.2401.08406
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. O'Reilly Media, Inc.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few‐shot learners. In Advances in neural information processing systems (Vol. 33) (pp. 1877–1901). Curran Associates, Inc.
Caruth, G. (2013). Demystifying mixed methods research design: A review of the literature. Mevlana International Journal of Education, 3(2), 112–122. https://doi.org/10.13054/mije.13.35.3.2
Conover, M., Hayes, M., Mathur, A., Xie, J., Wan, J., Shah, S., Ghodsi, A., Wendell, P., Zaharia, M., & Xin, R. (2023). Free Dolly: Introducing the world's first truly open instruction‐tuned LLM. Retrieved May 9, 2025, from https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
Didwania, K., Seth, P., Kasliwal, A., & Agarwal, A. (2024). Agrillm: Harnessing transformers for farmer queries. arXiv. https://doi.org/10.48550/arXiv.2407.04721
Ding, B., Qin, C., Liu, L., Chia, Y. K., Li, B., Joty, S., & Bing, L. (2023). Is GPT‐3 a good data annotator? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 11173–11195). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.626
Donovan, J., Coaldrake, P., Rutsaert, P., Bänzinger, M., Gitonga, A., Naziri, D., Demont, M., Newby, J., & Ndegwa, M. (2022). Market intelligence for informing crop‐breeding decisions by CGIAR and NARES. Market Intelligence Brief Series 1. CGIAR.
Forgy, E. W. (1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21, 768–769.
Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial‐strength natural language processing in Python. Zenodo. https://doi.org/10.5281/zenodo.1212303
Ivankova, N. V., Creswell, J. W., & Stick, S. L. (2006). Using mixed‐methods sequential explanatory design: From theory to practice. Field Methods, 18(1), 3–20. https://doi.org/10.1177/1525822X05282260
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.‐A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). Mistral 7B. arXiv. https://doi.org/10.48550/arXiv.2310.06825
Kuzman, T., Mozetič, I., & Ljubešić, N. (2023). ChatGPT: Beginning of an end of manual linguistic data annotation? Use case of automatic genre identification. arXiv. https://doi.org/10.48550/arXiv.2303.03953
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1) (pp. 281–298). University of California Press.
Maltseva, K. (2016). Using correspondence analysis of scales as part of mixed methods design to access cultural models in ethnographic fieldwork: Prosocial cooperation in Sweden. Journal of Mixed Methods Research, 10(1), 82–111. https://doi.org/10.1177/1558689814525262
Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. MIT Press.
Occelli, M., Mukerjee, R., Miller, C., Porciello, J., Puerto, S., Garner, E., Guerra, M., Gomez, M. I., & Tufan, H. A. (2024). A scoping review on tools and methods for trait prioritization in crop breeding programmes. Nature Plants, 10(3), 402–411. https://doi.org/10.1038/s41477-024-01639-6
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., & Vanderplas, J. (2011). Scikit‐learn: Machine learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre‐training.
Ragot, M., Bonierbale, M., & Weltzien, E. (2018). From Market Demand to Breeding Decisions: A Framework. GBI Working Paper. No. 2. CGIAR Gender and Breeding Initiative.
Silva, B., Nunes, L., Estevão, R., Aski, V., & Chandra, R. (2023). GPT‐4 as an agronomist assistant? Answering agriculture exams using large language models. arXiv. https://doi.org/10.48550/arXiv.2310.06225
Spärk Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21. https://doi.org/10.1108/eb026526
Tzachor, A., Devare, M., Richards, C., Pypers, P., Ghosh, A., Koo, J., Johal, S., & King, B. (2023). Large language models and agricultural extension services. Nature Food, 4(11), 941–948. https://doi.org/10.1038/s43016-023-00867-x
Wale, E., & Yalew, A. (2007). Farmers' variety attribute preferences: Implications for breeding priority setting and agricultural extension policy in Ethiopia. African Development Review, 19(2), 379–396. https://doi.org/10.1111/j.1467-8268.2007.00167.x
Wang, S., Liu, Y., Xu, Y., Zhu, C., & Zeng, M. (2021). Want to reduce labeling cost? GPT‐3 can help. arXiv. https://doi.org/10.48550/arXiv.2108.13487
Wossen, T., Girma, G., Abdoulaye, T., & Rabbi, I. (2017). The cassava monitoring survey in Nigeria. International Institute of Tropical Agriculture (IITA).
Yang, S., Yuan, Z., Li, S., Peng, R., Liu, K., & Yang, P. (2024). GPT‐4 as evaluator: Evaluating large language models on pest management in agriculture. arXiv. https://doi.org/10.48550/arXiv.2403.11858
Zhao, B., Jin, W., Del Ser, J., & Yang, G. (2023). ChatAgri: Exploring potentials of ChatGPT on cross‐linguistic agricultural text classification. Neurocomputing, 557, 126708. https://doi.org/10.1016/j.neucom.2023.126708
Zhu, Y., Zhang, P., Haq, E.‐U., Hui, P., & Tyson, G. (2023). Can ChatGPT reproduce human‐generated labels? A study of social computing tasks. arXiv. https://doi.org/10.48550/arXiv.2304.10145
Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024). Can large language models transform computational social science? arXiv. https://doi.org/10.48550/arXiv.2305.03514
© 2026. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.