-
Abbreviations
- CCC
- Lin's concordance correlation coefficient
- CV
- cross-validation
- R2cv
- coefficient of multiple determination of cross-validation
- D1
- first derivative
- D2
- second derivative
- GD
- gap derivative
- GUI
- graphical user interface
- ML
- machine learning
- NIRS
- near-infrared spectroscopy
- PLSR
- partial least squares regression
- R2p
- squared Pearson's correlation of predicted and observed values in a test set
- R2sp
- squared Spearman's correlation of predicted and observed values in a test set
- RF
- random forest; RMSE, root mean squared error
- RPD
- residual predictive deviation
- RPIQ
- ratio of performance to interquartile distance
- SE
- standard error
- SG
- Savitzky-Golay filter
- SNV
- standard normal variate
- SVM
- support vector machine
- Vis
- visible light.
Visible and near-infrared spectroscopy (Vis-NIRS) allows for rapid and low-cost analysis of the physical and chemical makeup of biological samples in a non-destructive manner through the calibration of prediction models relating spectra to reference values. This is especially useful in the context of plant breeding programs, which often rely on expensive and time-consuming quality trait information for the development and release of improved plant varieties. In these cases, the prediction of quality traits with Vis-NIRS affords an opportunity to increase the efficiency of varietal development.
The use of NIRS to predict quality traits in agricultural products revolutionized the industry over 50 yr ago (McClure, 2003; dos Santos et al., 2013), but only with the recent application of microelectromechanical systems to spectroscopy has the cost of high-quality spectrometers reached a reasonable price point for most plant breeding and research programs (Crocombe, 2004; Pasquini, 2018; dos Santos et al., 2013). To evaluate and appropriately deploy these new tools in research and other high-value applications, prediction models must be developed for each phenotype of interest. Although many spectrometers come equipped with proprietary models that are pre-trained for phenotypes such as total protein content or brix, low-cost hardware often do not and instead require the development and maintenance of prediction models to be performed by the user. Model calibration as a service and standalone commercial software are viable solutions, but the high cost of both negates the benefits of an inexpensive spectrometer.
Open-source software packages for the analysis of Vis-NIR spectral data are available in several programming languages, including R (R Core Team, 2020) and python (Van Rossum & Drake, 2009), but thorough analysis, regardless of the language, requires functions to be systematically assembled together from many different packages or libraries. In R, a user interested in spectral analysis must pull together preprocessing functions from prospector (Stevens & Ramirez-Lopez, 2014) and model training functions from packages such as pls (Mevik, Wehrens, & Liland, 2019), randomForest (Liaw & Wiener, 2002), kernlab (Karatzoglou, Smola, Hornik, & Zeileis, 2004), and caret (Kuhn, 2020). Without a working knowledge of these programming languages and training in Vis-NIR reflectance analysis, the construction of a functional pipeline is unquestionably challenging. A web-based graphical user interface (GUI) would lower the entry point to spectral analysis by making a complete storage and analysis pipeline available to anyone with internet access, effectively democratizing Vis-NIRS when paired with low-cost spectrometers.
We developed an R package, waves, that brings essential spectral analysis functions together to enable streamlined filtering, preprocessing, model training, and phenotype prediction with cross-validation schemes tailored to address prediction scenarios commonly faced by plant breeding programs (Jarquín et al., 2017). To further leverage the functionality of waves in a database environment without a need to be familiar with the R programming language, we also created a waves-based spectral data analysis tool in Breedbase, an open-source family of relational databases for the storage and analysis of plant breeding genomic and phenomic data (
FIGURE 1. waves R package functions are integrated into the Breedbase spectral analysis tool for spectral data storage, preprocessing, cross-validation set formation model development, model storage, and trait prediction
This integration also includes the development of spectral data storage and handling capabilities within the database, linking spectral JSONB tables with related field trial meta- and phenotypic data (Figure 2).
FIGURE 2. The Breedbase spectral analysis tool can be accessed using a web-based graphical user interface, shown here in the Cassavabase Breedbase instance
The Breedbase digital ecosystem allows users to store all available data across germplasm sets and environments, which can then be accessed to customize calibration models. To our knowledge, this is the first open-source integration of spectral data storage and analysis with plant breeding database tools.
waves is a standalone R package that is called by the backend Perl, JavaScript, and R code base of Breedbase (
- waves is an open-source R package for spectral data analysis in plant breeding.
- waves includes breeding relevant cross-validation schemes to evaluate predictive accuracy of models.
- Breedbase—an open-source database—was extended to support spectral data storage.
- A new graphical user interface was developed for implementation of waves in Breedbase.
The storage of spectral data can be a major challenge for Vis-NIRS users, as high-dimensional datasets are common and must be manipulated in a controlled manner to correctly execute calibration procedures. Though proprietary software often offers both storage and analysis features, open-source resources are neither linked to phenotype data nor optimized for analytical purposes (Chalk, 2016;
Spectral calibration models can be heavily affected by the presence of outliers, whether they are the result of spectrometer artifacts or user error. Mahalanobis distance (Mahalanobis, 1936) is a measure of the distance between a single observation and a larger distribution and is commonly used for the objective identification of outliers in a multivariate space, including spectral datasets (De Maesschalck et al., 2000). The FilterSpectra() function in waves identifies outliers by calculating the Mahalanobis distance of each observation in a given spectral matrix using the mahalanobis() function from the stats package (R Core Team, 2020). In Breedbase, this procedure is applied on a per-dataset basis on upload and outliers are given binary tags “outlier.”
After outlier identification, data can be visualized using the PlotSpectra() function in waves. This function uses the filtered spectra and the ggplot() function from the ggplot2 package (Wickham, 2016) to create a line plot with outliers highlighted by line color. A list of rows identified as outliers are shown beneath the plot. In Breedbase, these plots are saved as .png files and linked to the original input datasets. Plot image files can be downloaded via the “Download Plot” button on the upload webpage. To obtain a stable and reliable spectral profile, most spectrometer manufacturers recommend that multiple spectral scans are captured for each sample. Although some spectrometers aggregate these scans internally, many do not, requiring the user to do so before analysis can take place. Breedbase handles these cases upon data upload following filtering steps by calling the AggregateSpectra() function from waves, saving the aggregated scans for future access through the search wizard feature. After aggregation, the user exits the upload workflow, and the raw data file is saved and accessible through the Breedbase file storage system.
AnalysisTo initiate an analysis, the user must select a dataset using the Breedbase search wizard tool. Breedbase datasets can contain observation unit-level (plot-, plant-, or sample-level) trial metadata and phenotypic data from one or more trials. After navigating to the “Spectral Analysis” webpage under the “Analysis” tab in Breedbase, the user can select one of these datasets as input for model development.
Preprocessing, also known as pretreatment, is often used to increase the signal to noise ratio in Vis-NIR datasets (Rinnan, van de Berg, & Engelsen, 2009). The waves function DoPreprocessing() applies functions from the stats (R Core Team, 2020) and prospectr (Stevens & Ramirez-Lopez, 2014) packages for spectral preprocessing, giving the user the option to keep the raw data (default in Breedbase) or use any of 12 combinations of common preprocessing methods including standard normal variate (SNV; Barnes et al., 1989), first and second derivatives, and Savitzky-Golay polynomial smoothing (Savitzky & Golay, 1964). If preprocessing is selected for a given analysis, model performance statistics for both raw data and all preprocessing methods will be displayed in the model training workflow. The model corresponding to the preprocessing method resulting in the lowest root mean squared error (RMSE) is then stored in Breedbase.
During analysis, the user can select from six cross-validation (CV) schemes that represent scenarios common in plant breeding using the FormatCV() function in waves. These include CV1, CV2, CV0, and CV00 as described in depth by Jarquín et al. (2017), as well as random and stratified random sampling. For those four schemes from Jarquín et al., the user must choose specific trials from their input dataset based on genotype and environment relatedness. Guides for these selections are available in the waves and Breedbase manuals, and they are accessible as pop-ups through the spectral analysis webpage.
Several common machine learning (ML) algorithms are available for calibration model development in Breedbase via waves. The TrainSpectralModel() function in waves performs hyperparameter tuning as applicable using these ML algorithms in combination with CV and train functions from the package caret (Kuhn, 2020). The current version of waves supports three ML algorithms; partial least squares regression (PLSR; Wold, 1982; Wold et al., 1984) is implemented using the pls package (Mevik et al., 2019), random forest regression (RF; Ho, 1995) is implemented with the randomForest package (Liaw & Wiener, 2002), and support vector machine regression (SVM; Vapnik, 2000) is from the kernlab package (Karatzoglou et al., 2004). The modular nature of this function allows for the future inclusion of available ML approaches.
The TestModelPerformance() function in waves acts as a wrapper for the iterative testing of different selections of training and testing sets. It calls the DoPreprocessing(), FormatCV(), and TrainSpectralModel() functions and outputs model performance statistics for each iteration of set formation.
OutputAfter training, common model performance statistics are both displayed on a results webpage and made available for download in .csv format. These statistics are calculated by the TrainSpectralModel() function in waves using the spectacles package (Roudier, 2020).
Once a model has been trained, it can be stored for later use. This action calls the SaveModel() function from waves. Metadata regarding the training dataset and other parameters specified by the user upon training initialization are stored alongside the model object itself in the database. Phenotype predictions can be performed by pairing saved models with spectral datasets that were collected using the same spectrometer type. Predicted phenotypes are stored as such in the database and are tagged with an ontology term specifying that they are predicted and not directly measured. Metadata regarding the model used for prediction are stored alongside the predicted value in the database. Predicted phenotypes are then available for download or use in other Breedbase analysis tools such as the Selection Index and Genomic Selection to support decision making in the plant breeding process.
EXAMPLE DATASET AND PERFORMANCE TESTSAn example dataset of Vis-NIR and reference phenotypic data from Ikeogu et al. (2017) is included in the package waves. In their study, spectra were collected from freshly sliced and shredded cassava roots using a QualitySpec Trek Vis-NIR spectrometer and phenotypic reference data include root dry matter content (RDMC) using the oven method, as well as total carotenoid content (TCC) as measured by high performance liquid chromatography. When this dataset was analyzed using the TestModelPerformance() function of waves with the same training and test sets as Ikeogu et al. (2017), the predictive performance of our best PLSR models (Figure 3, Supplemental Table S2) was similar to their published best modified PLSR models as developed with commercial, proprietary software.
FIGURE 3. Distributions of R2p, the squared Pearson's correlation between predicted and observed for the test set, for partial least squares regression models of two root quality traits trained on samples from the C16Mcal dataset and tested on samples from the C16Mval dataset from Ikeogu et al. (2017) with raw data or after pretreatment. Commercial software performance using the same datasets as reported by Ikeogu et al. (2017) is displayed as horizontal dashed lines in corresponding colors for each trait. *SNV, standard normal variate; SNV1D, standard normal variate and first derivative; SNV2D, standard normal variate and second derivative; D1, first derivative; D2, second derivative; SG, Savitzky-Golay with window size = 11; SNVSG, standard normal variate and Savitzky-Golay; SGD1, gap segment derivative with window size = 11; SG.D1W5, Savitzky-Golay with window size = 5 and first derivative; SG.D1W11, Savitzky-Golay with window size = 11 and first derivative; SG.D2W5, Savitzky-Golay with window size = 5 and second derivative; SG.D2W11, Savitzky-Golay with window size = 11 and second derivative
For RDMC, the best model trained by waves generated an R2p of .86 using a Savitzky-Golay filter, whereas the best model in Ikeogu et al. (2017) was generated with an SNV detrend pretreatment and had an R2p of .84. Similarly, waves achieved an R2p of .95 for TCC with a gap derivative filter, and Ikeogu et al. (2017) generated an R2p of .86 using an SNV detrend filter with the same datasets.
CONCLUSIONThe open-source R package waves provides a comprehensive spectral data analysis pipeline that is tailored to the unique needs of plant breeders and their breeding programs. Integration of this package with Breedbase provides users with a GUI for improved ease-of-use within a complete digital ecosystem. Ultimately, these tools will help to improve the turnaround time for routine non-destructive phenotyping analysis, facilitating more rapid critical decision making in plant breeding programs.
ACKNOWLEDGMENTSWe thank the NextGen Cassava project for their interest in and support of this work, especially Prasad Peteti and Afolabi Agbona for many fruitful discussions on ontology implementation. Thanks also to Gaby Mbanjo, Enoch Wembabazi, and Joshua Anderson for testing the pipeline. This work has been supported by USDA NIFA AFRI EWD Predoctoral Fellowship 2019-67011-29606 (J.H.), NSF BREAD IOS-1543958 (M.A.G.), the UK Foreign, Commonwealth and Development Office (L.A.M.), the Bill and Melinda Gates Foundation grant INV-007637 (L.A.M.), AfricaYam (L.A.M.), and GT4SP (L.A.M.). Program activities have also been funded by the U.S. Agency for International Development (USAID) under Cooperative Agreement no. 7200AA-19LE-00005.
AUTHOR CONTRIBUTIONSJ. Hershberger: Conceptualization; Software; Writing-original draft. N. Morales: Software; Writing-review & editing. C.C. Simoes: Software; Writing-review & editing. B. Ellerbrock: Conceptualization; Software; Writing-review & editing. G. Bauchet: Conceptualization; Writing-review & editing. L.A. Mueller: Conceptualization; Funding acquisition; Supervision; Writing-review & editing. M.A. Gore: Conceptualization; Funding acquisition; Supervision; Writing-review & editing.
CONFLICT OF INTERESTNone declared.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2021. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Visible and near-infrared spectroscopy (Vis-NIRS) is a promising tool for increasing phenotyping throughput in plant-breeding programs, but existing analysis software packages are not optimized for a breeding context. Additionally, commercial software options are often outside of budget constraints for some breeding and research programs. To that end, we developed an open-source R package, waves, for the streamlined analysis of spectral data with several cross-validation schemes to assess prediction accuracy. waves is compatible with a wide range of spectrometer models and performs visualization, filtering, aggregation, cross-validation set formation, model training, and prediction functions for the association of Vis-NIR spectra with reference measurements. Furthermore, we have integrated this package into the Breedbase family of open-source databases, expanding the analysis capabilities of this growing digital ecosystem to a number of crop species. Taken together, the standalone and Breedbase versions of waves enhance the accessibility of tools for the analysis of spectral data during the plant breeding process.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer




