Making waves in Breedbase: An integrated spectral

Full text

Translate

Turn on search term navigation

Abbreviations

CCC
Lin's concordance correlation coefficient

CV
cross-validation

R²_cv
coefficient of multiple determination of cross-validation

D1
first derivative

D2
second derivative

GD
gap derivative

GUI
graphical user interface

ML
machine learning

NIRS
near-infrared spectroscopy

PLSR
partial least squares regression

R²_p
squared Pearson's correlation of predicted and observed values in a test set

R²_sp
squared Spearman's correlation of predicted and observed values in a test set

RF
random forest; RMSE, root mean squared error

RPD
residual predictive deviation

RPIQ
ratio of performance to interquartile distance

SE
standard error

SG
Savitzky-Golay filter

SNV
standard normal variate

SVM
support vector machine

Vis
visible light.

INTRODUCTION

Visible and near-infrared spectroscopy (Vis-NIRS) allows for rapid and low-cost analysis of the physical and chemical makeup of biological samples in a non-destructive manner through the calibration of prediction models relating spectra to reference values. This is especially useful in the context of plant breeding programs, which often rely on expensive and time-consuming quality trait information for the development and release of improved plant varieties. In these cases, the prediction of quality traits with Vis-NIRS affords an opportunity to increase the efficiency of varietal development.

The use of NIRS to predict quality traits in agricultural products revolutionized the industry over 50 yr ago (McClure, 2003; dos Santos et al., 2013), but only with the recent application of microelectromechanical systems to spectroscopy has the cost of high-quality spectrometers reached a reasonable price point for most plant breeding and research programs (Crocombe, 2004; Pasquini, 2018; dos Santos et al., 2013). To evaluate and appropriately deploy these new tools in research and other high-value applications, prediction models must be developed for each phenotype of interest. Although many spectrometers come equipped with proprietary models that are pre-trained for phenotypes such as total protein content or brix, low-cost hardware often do not and instead require the development and maintenance of prediction models to be performed by the user. Model calibration as a service and standalone commercial software are viable solutions, but the high cost of both negates the benefits of an inexpensive spectrometer.

Open-source software packages for the analysis of Vis-NIR spectral data are available in several programming languages, including R (R Core Team, 2020) and python (Van Rossum & Drake, 2009), but thorough analysis, regardless of the language, requires functions to be systematically assembled together from many different packages or libraries. In R, a user interested in spectral analysis must pull together preprocessing functions from prospector (Stevens & Ramirez-Lopez, 2014) and model training functions from packages such as pls (Mevik, Wehrens, & Liland, 2019), randomForest (Liaw & Wiener, 2002), kernlab (Karatzoglou, Smola, Hornik, & Zeileis, 2004), and caret (Kuhn, 2020). Without a working knowledge of these programming languages and training in Vis-NIR reflectance analysis, the construction of a functional pipeline is unquestionably challenging. A web-based graphical user interface (GUI) would lower the entry point to spectral analysis by making a complete storage and analysis pipeline available to anyone with internet access, effectively democratizing Vis-NIRS when paired with low-cost spectrometers.

We developed an R package, waves, that brings essential spectral analysis functions together to enable streamlined filtering, preprocessing, model training, and phenotype prediction with cross-validation schemes tailored to address prediction scenarios commonly faced by plant breeding programs (Jarquín et al., 2017). To further leverage the functionality of waves in a database environment without a need to be familiar with the R programming language, we also created a waves-based spectral data analysis tool in Breedbase, an open-source family of relational databases for the storage and analysis of plant breeding genomic and phenomic data (https://breedbase.org; see Figure 1).

View Image - FIGURE 1. waves R package functions are integrated into the Breedbase spectral analysis tool for spectral data storage, preprocessing, cross-validation set formation model development, model storage, and trait prediction

FIGURE 1. waves R package functions are integrated into the Breedbase spectral analysis tool for spectral data storage, preprocessing, cross-validation set formation model development, model storage, and trait prediction

This integration also includes the development of spectral data storage and handling capabilities within the database, linking spectral JSONB tables with related field trial meta- and phenotypic data (Figure 2).

View Image - FIGURE 2. The Breedbase spectral analysis tool can be accessed using a web-based graphical user interface, shown here in the Cassavabase Breedbase instance

FIGURE 2. The Breedbase spectral analysis tool can be accessed using a web-based graphical user interface, shown here in the Cassavabase Breedbase instance

The Breedbase digital ecosystem allows users to store all available data across germplasm sets and environments, which can then be accessed to customize calibration models. To our knowledge, this is the first open-source integration of spectral data storage and analysis with plant breeding database tools.

waves is a standalone R package that is called by the backend Perl, JavaScript, and R code base of Breedbase (https://breedbase.org/). The user manual for Breedbase is available at https://solgenomics.github.io/sgn with information regarding the waves integration in the sections titled “Managing Spectral Data” and “Spectral Analysis,” and the user manual for the standalone R package waves version 0.1.0 is available in the supplemental files. The R package waves is available through the Comprehensive R Archive Network (https://CRAN.R-project.org/package=waves).

Core Ideas

waves is an open-source R package for spectral data analysis in plant breeding.
waves includes breeding relevant cross-validation schemes to evaluate predictive accuracy of models.
Breedbase—an open-source database—was extended to support spectral data storage.
A new graphical user interface was developed for implementation of waves in Breedbase.

IMPLEMENTATION Data storage

The storage of spectral data can be a major challenge for Vis-NIRS users, as high-dimensional datasets are common and must be manipulated in a controlled manner to correctly execute calibration procedures. Though proprietary software often offers both storage and analysis features, open-source resources are neither linked to phenotype data nor optimized for analytical purposes (Chalk, 2016; https://spectralworkbench.org). Breedbase stores spectral data alongside other types of phenotypic data, eliminating the need for data matching during analysis. Breedbase instances are governed by controlled ontologies based on the Crop Ontology system (Shrestha et al., 2012). We have extended the statistical ontology for this system to include spectroscopy analysis algorithm terms (See Supplemental Table S1), and these terms are combined with existing trait ontology terms to generate spectroscopy-based predicted ontology terms. Spectral data can be uploaded to Breedbase as .csv files. Future data transfer using BrAPI (Selby et al., 2019) will allow for interoperability with data collection software, another step toward a complete digital ecosystem for spectral data. Given the wide variety of Vis-NIR spectrometers employed by plant breeding programs, we made the system flexible enough to handle spectral data irrespective of the source spectrometer. Although our system does not handle proprietary data formats, once they are transformed into JSON or .csv, the number of wavelengths and size of gaps between them does not limit compatibility.

Spectral calibration models can be heavily affected by the presence of outliers, whether they are the result of spectrometer artifacts or user error. Mahalanobis distance (Mahalanobis, 1936) is a measure of the distance between a single observation and a larger distribution and is commonly used for the objective identification of outliers in a multivariate space, including spectral datasets (De Maesschalck et al., 2000). The FilterSpectra() function in waves identifies outliers by calculating the Mahalanobis distance of each observation in a given spectral matrix using the mahalanobis() function from the stats package (R Core Team, 2020). In Breedbase, this procedure is applied on a per-dataset basis on upload and outliers are given binary tags “outlier.”

After outlier identification, data can be visualized using the PlotSpectra() function in waves. This function uses the filtered spectra and the ggplot() function from the ggplot2 package (Wickham, 2016) to create a line plot with outliers highlighted by line color. A list of rows identified as outliers are shown beneath the plot. In Breedbase, these plots are saved as .png files and linked to the original input datasets. Plot image files can be downloaded via the “Download Plot” button on the upload webpage. To obtain a stable and reliable spectral profile, most spectrometer manufacturers recommend that multiple spectral scans are captured for each sample. Although some spectrometers aggregate these scans internally, many do not, requiring the user to do so before analysis can take place. Breedbase handles these cases upon data upload following filtering steps by calling the AggregateSpectra() function from waves, saving the aggregated scans for future access through the search wizard feature. After aggregation, the user exits the upload workflow, and the raw data file is saved and accessible through the Breedbase file storage system.

Analysis

To initiate an analysis, the user must select a dataset using the Breedbase search wizard tool. Breedbase datasets can contain observation unit-level (plot-, plant-, or sample-level) trial metadata and phenotypic data from one or more trials. After navigating to the “Spectral Analysis” webpage under the “Analysis” tab in Breedbase, the user can select one of these datasets as input for model development.

Preprocessing, also known as pretreatment, is often used to increase the signal to noise ratio in Vis-NIR datasets (Rinnan, van de Berg, & Engelsen, 2009). The waves function DoPreprocessing() applies functions from the stats (R Core Team, 2020) and prospectr (Stevens & Ramirez-Lopez, 2014) packages for spectral preprocessing, giving the user the option to keep the raw data (default in Breedbase) or use any of 12 combinations of common preprocessing methods including standard normal variate (SNV; Barnes et al., 1989), first and second derivatives, and Savitzky-Golay polynomial smoothing (Savitzky & Golay, 1964). If preprocessing is selected for a given analysis, model performance statistics for both raw data and all preprocessing methods will be displayed in the model training workflow. The model corresponding to the preprocessing method resulting in the lowest root mean squared error (RMSE) is then stored in Breedbase.

During analysis, the user can select from six cross-validation (CV) schemes that represent scenarios common in plant breeding using the FormatCV() function in waves. These include CV1, CV2, CV0, and CV00 as described in depth by Jarquín et al. (2017), as well as random and stratified random sampling. For those four schemes from Jarquín et al., the user must choose specific trials from their input dataset based on genotype and environment relatedness. Guides for these selections are available in the waves and Breedbase manuals, and they are accessible as pop-ups through the spectral analysis webpage.

Several common machine learning (ML) algorithms are available for calibration model development in Breedbase via waves. The TrainSpectralModel() function in waves performs hyperparameter tuning as applicable using these ML algorithms in combination with CV and train functions from the package caret (Kuhn, 2020). The current version of waves supports three ML algorithms; partial least squares regression (PLSR; Wold, 1982; Wold et al., 1984) is implemented using the pls package (Mevik et al., 2019), random forest regression (RF; Ho, 1995) is implemented with the randomForest package (Liaw & Wiener, 2002), and support vector machine regression (SVM; Vapnik, 2000) is from the kernlab package (Karatzoglou et al., 2004). The modular nature of this function allows for the future inclusion of available ML approaches.

The TestModelPerformance() function in waves acts as a wrapper for the iterative testing of different selections of training and testing sets. It calls the DoPreprocessing(), FormatCV(), and TrainSpectralModel() functions and outputs model performance statistics for each iteration of set formation.

Output

After training, common model performance statistics are both displayed on a results webpage and made available for download in .csv format. These statistics are calculated by the TrainSpectralModel() function in waves using the spectacles package (Roudier, 2020).

Once a model has been trained, it can be stored for later use. This action calls the SaveModel() function from waves. Metadata regarding the training dataset and other parameters specified by the user upon training initialization are stored alongside the model object itself in the database. Phenotype predictions can be performed by pairing saved models with spectral datasets that were collected using the same spectrometer type. Predicted phenotypes are stored as such in the database and are tagged with an ontology term specifying that they are predicted and not directly measured. Metadata regarding the model used for prediction are stored alongside the predicted value in the database. Predicted phenotypes are then available for download or use in other Breedbase analysis tools such as the Selection Index and Genomic Selection to support decision making in the plant breeding process.

EXAMPLE DATASET AND PERFORMANCE TESTS

An example dataset of Vis-NIR and reference phenotypic data from Ikeogu et al. (2017) is included in the package waves. In their study, spectra were collected from freshly sliced and shredded cassava roots using a QualitySpec Trek Vis-NIR spectrometer and phenotypic reference data include root dry matter content (RDMC) using the oven method, as well as total carotenoid content (TCC) as measured by high performance liquid chromatography. When this dataset was analyzed using the TestModelPerformance() function of waves with the same training and test sets as Ikeogu et al. (2017), the predictive performance of our best PLSR models (Figure 3, Supplemental Table S2) was similar to their published best modified PLSR models as developed with commercial, proprietary software.

View Image - FIGURE 3. Distributions of R2p, the squared Pearson's correlation between predicted and observed for the test set, for partial least squares regression models of two root quality traits trained on samples from the C16Mcal dataset and tested on samples from the C16Mval dataset from Ikeogu et al. (2017) with raw data or after pretreatment. Commercial software performance using the same datasets as reported by Ikeogu et al. (2017) is displayed as horizontal dashed lines in corresponding colors for each trait. *SNV, standard normal variate; SNV1D, standard normal variate and first derivative; SNV2D, standard normal variate and second derivative; D1, first derivative; D2, second derivative; SG, Savitzky-Golay with window size = 11; SNVSG, standard normal variate and Savitzky-Golay; SGD1, gap segment derivative with window size = 11; SG.D1W5, Savitzky-Golay with window size = 5 and first derivative; SG.D1W11, Savitzky-Golay with window size = 11 and first derivative; SG.D2W5, Savitzky-Golay with window size = 5 and second derivative; SG.D2W11, Savitzky-Golay with window size = 11 and second derivative

FIGURE 3. Distributions of R2p, the squared Pearson's correlation between predicted and observed for the test set, for partial least squares regression models of two root quality traits trained on samples from the C16Mcal dataset and tested on samples from the C16Mval dataset from Ikeogu et al. (2017) with raw data or after pretreatment. Commercial software performance using the same datasets as reported by Ikeogu et al. (2017) is displayed as horizontal dashed lines in corresponding colors for each trait. *SNV, standard normal variate; SNV1D, standard normal variate and first derivative; SNV2D, standard normal variate and second derivative; D1, first derivative; D2, second derivative; SG, Savitzky-Golay with window size = 11; SNVSG, standard normal variate and Savitzky-Golay; SGD1, gap segment derivative with window size = 11; SG.D1W5, Savitzky-Golay with window size = 5 and first derivative; SG.D1W11, Savitzky-Golay with window size = 11 and first derivative; SG.D2W5, Savitzky-Golay with window size = 5 and second derivative; SG.D2W11, Savitzky-Golay with window size = 11 and second derivative

For RDMC, the best model trained by waves generated an R²_p of .86 using a Savitzky-Golay filter, whereas the best model in Ikeogu et al. (2017) was generated with an SNV detrend pretreatment and had an R²_p of .84. Similarly, waves achieved an R²_p of .95 for TCC with a gap derivative filter, and Ikeogu et al. (2017) generated an R²_p of .86 using an SNV detrend filter with the same datasets.

CONCLUSION

The open-source R package waves provides a comprehensive spectral data analysis pipeline that is tailored to the unique needs of plant breeders and their breeding programs. Integration of this package with Breedbase provides users with a GUI for improved ease-of-use within a complete digital ecosystem. Ultimately, these tools will help to improve the turnaround time for routine non-destructive phenotyping analysis, facilitating more rapid critical decision making in plant breeding programs.

ACKNOWLEDGMENTS

We thank the NextGen Cassava project for their interest in and support of this work, especially Prasad Peteti and Afolabi Agbona for many fruitful discussions on ontology implementation. Thanks also to Gaby Mbanjo, Enoch Wembabazi, and Joshua Anderson for testing the pipeline. This work has been supported by USDA NIFA AFRI EWD Predoctoral Fellowship 2019-67011-29606 (J.H.), NSF BREAD IOS-1543958 (M.A.G.), the UK Foreign, Commonwealth and Development Office (L.A.M.), the Bill and Melinda Gates Foundation grant INV-007637 (L.A.M.), AfricaYam (L.A.M.), and GT4SP (L.A.M.). Program activities have also been funded by the U.S. Agency for International Development (USAID) under Cooperative Agreement no. 7200AA-19LE-00005.

AUTHOR CONTRIBUTIONS

J. Hershberger: Conceptualization; Software; Writing-original draft. N. Morales: Software; Writing-review & editing. C.C. Simoes: Software; Writing-review & editing. B. Ellerbrock: Conceptualization; Software; Writing-review & editing. G. Bauchet: Conceptualization; Writing-review & editing. L.A. Mueller: Conceptualization; Funding acquisition; Supervision; Writing-review & editing. M.A. Gore: Conceptualization; Funding acquisition; Supervision; Writing-review & editing.

CONFLICT OF INTEREST

None declared.

Word count: 2587

Show less

© 2021. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Visible and near-infrared spectroscopy (Vis-NIRS) is a promising tool for increasing phenotyping throughput in plant-breeding programs, but existing analysis software packages are not optimized for a breeding context. Additionally, commercial software options are often outside of budget constraints for some breeding and research programs. To that end, we developed an open-source R package, waves, for the streamlined analysis of spectral data with several cross-validation schemes to assess prediction accuracy. waves is compatible with a wide range of spectrometer models and performs visualization, filtering, aggregation, cross-validation set formation, model training, and prediction functions for the association of Vis-NIR spectra with reference measurements. Furthermore, we have integrated this package into the Breedbase family of open-source databases, expanding the analysis capabilities of this growing digital ecosystem to a number of crop species. Taken together, the standalone and Breedbase versions of waves enhance the accessibility of tools for the analysis of spectral data during the plant breeding process.

Details

Title

Making waves in Breedbase: An integrated spectral data storage and analysis pipeline for plant breeding programs

Section

SCIENCE NOTES

Publication year

2021

Publication date

2021

Publisher

John Wiley & Sons, Inc.

e-ISSN

25782703

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/ppj2.20012

ProQuest document ID

2823081357

Making waves in Breedbase: An integrated spectral data storage and analysis pipeline for plant breeding programs

Jump to:

Full text

Abstract

Details

Suggested sources