Introduction
The need for high-quality soil data has grown exponentially in recent years. Monitoring soils for sustainable development is a high priority at this moment as a number of large existential environmental challenges have been linked to the role that soils play [1, 2]. Healthy soils are essential for the growth of crops, filtration of water, functioning of ecosystems, and storage of vast amounts of carbon, with this latter point being not only crucial for climate change mitigation potential but also for sustainable land stewardship, ultimately contributing to improved agricultural and environmental outcomes [3–5]. In turn, soil scientists have been struggling to meet this demand because measuring soil properties still relies largely on physical sampling and resource-intensive benchtop (wet-chemistry) analytical methods [6]. Similarly, many investigations and modeling of the global challenges of soil security still rely on limited soil expertise or legacy data, which is not compatible with urgent needs [7, 8]. Tackling the challenge of meeting society’s demands for soil information has been proposed with strong community-based research efforts and the incorporation of novel technologies, like diffuse reflectance spectroscopy (DRS) [6, 7]. DRS has been promoted as a technology of the 21st century for estimating soil properties [9]. The predictions from DRS are highly repeatable and reproducible, costing a fraction of the cost of a chemical analysis [10]. Due to these appealing characteristics, DRS is becoming an essential tool for measuring and monitoring soil by producing large amounts of quantitative information in a broad range of applications [9].
DRS, the measurement of light absorption in the visible and near-infrared (VisNIR; 350–2500 nm) or mid-infrared (MIR; 2500–25000 nm [4000–400 cm−1]) regions of the electromagnetic spectrum, has emerged as an important, rapid and low-cost complement to traditional wet chemical analyzes [9, 11, 12]. The proportion of incident radiation reflected by the soil that is detected through VisNIR-MIR reflectance spectroscopy can be used to estimate numerous soil properties using statistical and machine learning methods [13–15]. However, a bottleneck to the more widespread adoption of soil spectroscopy is the need for establishing large reference training data sets, which are called soil spectral libraries (SSLs) [9]. In fact, for dealing with complex relationships between soil components and the spectra, irrespective of the multivariate statistical or machine learning method (e.g., local or global modeling), the prediction capacity for new soil samples is still subject to the number (quantity) and diversity (quality) of soil types and conditions that are represented in the feature space (spectra), a limitation that is not exclusive to soil spectroscopy but rather common to any machine learning problem [16].
In recent years, there were some attempts to build a global soil spectral library based on collaborative contributions that could serve the soil science community, e.g. by Viscarra-Rossel et al. [14] and Demattê et al. [17]. More recently, the Global Soil Partnership of the Food and Agriculture Origination (FAO) of the United Nations and its Global Soil Laboratory Network initiative on soil spectroscopy (GLOSOLAN-Spec) started engaging the international soil science community to connect conventional soil testing laboratories to those having or interested in having spectroscopy capacity. An unprecedented effort to establish standards and protocols for soil spectroscopy has emerged as part of the Institute of Electrical and Electronics Engineers (IEEE) Standards Association (AS) P4005 working group. Nonetheless, a step towards better community integration that genuinely observes open-data principles was lacking at the moment. It is important to highlight that open data plays a crucial role in advancing scientific research and innovation [18, 19]. With this principle, new collaboration and research can be fostered, and new adopters can join the community to make soil spectroscopy stronger with a global and open soil spectral library. Furthermore, the open data principle allows others to verify, reproduce, and improve, building trust around the potential of DRS for soil properties estimation [6, 19].
To bridge this data gap and enable hundreds of stakeholders interested in soil information generation by leveraging a centralized open resource, the Soil Spectroscopy for Global Good (SS4GG) [20] initiative has created an open-source and open-data project called the “Open Soil Spectral Library” (OSSL) [21]. The SS4GG was designed to accelerate the pace of scientific discovery in soil spectroscopy by facilitating and supporting a collaborative network of researchers. With the OSSL, the SS4GG team seeks to develop advanced yet intuitive, open-source, web-hosted resources to predict various soil properties from spectra collected on any spectrometer anywhere in the world. This initiative brings together soil scientists, spectroscopists, pedometricians, data scientists, and others to overcome some of the current bottlenecks preventing wider and more efficient use of soil spectroscopy. A series of events and activities have been initiated to address topics including lab-to-lab interoperability [22], data harmonization, model development, community engagement & outreach, and the use of spectroscopy to inform environmental modeling [20].
This paper presents a detailed and technical description of the OSSL and the resources derived that are currently available to the soil spectroscopy community. The OSSL is based primarily on the work and ideas of Viscarra-Rossel et al. [14] and Shepherd et al. [6]; it can be considered a scale-up of the first global soil calibration library produced by ICRAF-ISRIC [23]. Here, we describe the methods for collecting and harmonizing several SSLs that are incorporated into the OSSL database. We then provide an exploratory data analysis of the combined data and report the accuracy of the fitted prediction models built on top of the OSSL database. All processing steps and the modeling inputs and outputs are fully documented on Github repositories [24] and the OSSL technical manual [21]. In this paper, we provide only key results from the accuracy assessment and discuss possible limitations and future development directions. The OSSL is an ongoing project, being periodically updated with new data and model re-calibration.
Materials and methods
Preparing the OSSL database
The OSSL is made up of several SSLs that present a data-sharing policy. The latest version (v1.2) includes the United States Department of Agriculture (USDA) Natural Resources Conservation Service (NRCS) National Cooperative Soil Survey (NCSS)—Kellogg Soil Survey Laboratory (KSSL) MIR [25–27] and VisNIR libraries, labeled KSSL, with the VisNIR dataset being sourced from the Rapid Carbon Assessment Project (RaCA) project from USDA-NRCS [28, 29]; World Agroforestry Centre (ICRAF) MIR and VisNIR libraries, developed in cooperation with the International Soil Reference and Information Centre (ISRIC) [30], labeled ICRAF-ISRIC; two MIR libraries of the Africa Soil Information Service (AfSIS) [31, 32], labeled as AFSIS1 and AFSIS2; the VisNIR library from the European Data Centre (ESDAC)—Joint Research Centre (JRC) as part of the Land Use/Cover Area frame statistical Survey (LUCAS soil of 2009, 2012, and 2015) [33], labeled LUCAS; the Central African MIR SSL developed by ETH Zurich in the Congo Basin [34], labeled CAF; a MIR library of New Zealand forest soils from researchers of Scion Research, NZ [35], labeled Garrett; a MIR library of boreal soils from researchers of the University of Zurich [36], labeled Schiedung; and a subset of the Serbian SSL from the University of Novi Sad [37], labeled Serbia.
The KSSL MIR dataset represents a snapshot from July 2022, as this large SSL keeps growing in size as more samples are scanned. Aliquots of around 600 samples from the LUCAS SSL were shipped to Woodwell Climate Research Center and scanned using an MIR device (Bruker Vertex 70 with a Pike Autodiff accessory). In terms of data licensing, KSSL, ICRAF-ISRIC, AFSIS2, CAF, Schiedung, and Garrett present a Creative Commons (CC) license, while LUCAS has a JRC license agreement with “(…) graphical representation of individual units on a map is permitted as long as the geographical locations of the soil samples are not detectable” (i.e., downgraded to about 1 km of precision), AFSIS1 has an Open Data License (ODbL), and Serbia was published in a paper and shared internally with the OSSL team.
In addition to the full VisNIR and MIR spectral ranges, we imported into the OSSL a library limited to a specific NIR range (1350–2550 nm) built using the NeoSpectra Handheld NIR Analyzer developed by Si-Ware. This library includes 2,106 distinct mineral soil samples scanned on 9 portable low-cost NIR spectrometers identified by serial number. The 2,016 soil samples were obtained from USDA NRCS NSSC-KSSL Soil Archives and 90 samples were selected across Ghana, Kenya, and Nigeria. These samples were scanned by the Woodwell Climate Research Center and the University of Nebraska—Lincoln, which were made available with a CC-BY 4.0 license on Zenodo [38]. The number of unique samples correctly imported into the OSSL per original source and type of spectra is described in Table 1. A diagram of the general OSSL workflow is provided in Fig 1.
[Figure omitted. See PDF.]
Compilation and harmonization of the Open Soil Spectral Library (OSSL) from original data producers, and modeling framework that includes model calibration, evaluation, additional outputs, and open resources.
[Figure omitted. See PDF.]
Harmonizing the SSLs
Different SSLs may cause prediction bias in external samples [39] for a number of reasons including different analytical reference methods and standard operating procedures (SOPs) [40], or differences in scanning conditions affected by the instruments themselves and the SOPs used for scanning [41]. Contrasting methods used to determine (i.e., wet chemistry) a given soil property have been a subject of ongoing discussion. Global efforts such as the Global Soil Laboratory Network (GLOSOLAN) [42] grapple with this same challenge in their soil databases, but there is still no clear consensus on how to harmonize those different analytical methods.
To maximize transparency, we have decided to produce two different levels for the analytical reference methods in the OSSL database. Level 0 takes into account the original methods employed in each data set, but initially tries to fit them to two reference guides: i) the KSSL Guidance—Laboratory Methods and Manuals [43]; ii) any method cataloged by ISO standard [44, 45]. If an analytical method does not fall into any previous methods and has a significant number of samples, then we create a new variable. The final harmonization takes place in OSSL level 1, where those common properties sharing different methods are converted to a target using some publicly available transformation rule, or in the worst scenario, they are naively bound or kept separated to maintain their specificity. The processing code is documented in the ossl-imports GitHub repository [24] allowing for community-based co-development.
For importing SSLs into the OSSL, we define a schema where soil site information, soil laboratory data (i.e., wet chemistry), and spectroscopic data are prepared as separate tables that share a common column ID for proper joining. For column names in soil laboratory data, we concatenate three different pieces of information separated by an underscore: abbreviation of the soil property name, analytical method code, and unit abbreviation. For example, plant available calcium (mg kg−1) using Mehlich 3 as the extractor, following the KSSL procedure, was named ca.ext_usda.a1059_mg.kg. Organic carbon (%) determined following ISO 10694 was formatted as oc_iso.10694_w.pct. This string pattern is followed in other files and column names with some minor adaptations where required. Reference tables with column names, abbreviations, units, data type, descriptions, and examples are provided in the ossl-imports GitHub repository [24]. A reference table containing a copy of the KSSL Guidance procedures is also provided in that repository. Lastly, the harmonization rules used to translate or combine the soil properties from level 0 to level 1 are also available there. Both levels are available in the OSSL database; however, only a selection of harmonized soil properties (level 1) containing at least 500 samples was used to calibrate the prediction models (Table 2). Summary statistics of the imported and harmonized soil properties, grouped by spectral region, are supplemented in S1 Table.
[Figure omitted. See PDF.]
For the spectral data (Fig 2), VisNIR spectra were formatted to reflectance factor per wavelength. The spectra imported into the OSSL were trimmed between 350 and 2500 nm, with an interval of 2 nm. MIR spectra were formatted to pseudo-absorbance units per wavenumber. The spectral range imported into OSSL falls between 600 and 4000 cm−1, with an interval of 2 cm−1. All data sets followed these specifications, and for some cases, we converted the reflectance (R) values into absorbance units (A) with A = log10(1/R). For measurements having a shorter or larger spacing interval, we used splines for re-sampling the spectral curves into the defined specifications [46]. The spectral variability of the three distinct spectral ranges is presented in S1–S3 Figs.
[Figure omitted. See PDF.]
Spectral variation of the mid-infrared (650–4000 cm−1, panels A and B), visible and near-infrared (350–2500 nm, panels C and D), and near-infrared from Neospectra (1350–2550 nm, panels E and F) imported into the Open Soil Spectral Library (OSSL). The shaded area depicts the maximum and minimum values of the spectral region. The black lines represent the mean of each original dataset imported into the OSSL. The left panels (A, C, and E) are the original spectra, while the right panels (B, D, and F) represent the spectra after preprocessing with Standard Normal Variate (SNV).
Modeling framework
We have used the MLR3 framework [47] to fit machine learning models. MLR3 is a package universe from R statistical programming language where the user can seamlessly define a machine learning pipeline for resampling, algorithm and model selection, optimization, evaluation, and other steps. In turn, for machine learning, Cubist (cubist acronym) [48, 49] was adopted as many studies have found it to be one of the top performing models for soil spectroscopy [22, 50, 51] and after some initial findings from exploratory benchmarks. We compared Cubist alone against a model ensemble composed of Cubist, Elastic Net, and Gradient Boosting Trees to minimize potential inductive bias introduced by model choice. However, we found that Cubist alone was yielding superior results without demanding too much computing time and complex operations, keeping our predictive modeling framework more parsimonious and robust [24]. Cubist is an ML algorithm that takes advantage of a decision-tree splitting method but fits linear regression models at each terminal leaf. It also uses a boosting mechanism (sequential trees adjusted by weight) that allows for the growth of a forest by tuning the number of committees. We have not used the correction of final predictions by the influence of nearest neighbors due to the lack of this feature in the MLR3 framework at the time of building OSSL.
The spectral dissimilarity across the original datasets that comprise the OSSL was handled by mathematical preprocessing with standard normal variate (SNV). This was adopted as per the findings of a ring trial experiment that was carried out as part of the SS4GG initiative [22] where we found that SNV is capable of reducing most of the variability caused by the contrasting instruments and scanning conditions (Fig 2), in addition to minimizing particle size, light scattering, and multicollinearity issues described by the inventor [52]. Similarly, recent studies have pointed out SNV as the best or comparably superior preprocessing algorithm for different spectral regions compared to other algorithms like moving window derivatives [53, 54]. Also, although spectral standardization may be necessary for very contrasting spectra found in heterogeneous libraries like the OSSL, this was not applied due to the absence of shared standard samples across the laboratories that developed the imported legacy libraries.
In terms of model types, we have fitted five different versions depending on the availability of samples in the database, relying solely on spectral variation to maximize applicability to different users (na acronym), i.e., site information or external environmental layers were not used as ancillary predictors, as they would require spatial and date information in the inputs. The model types were also composed of two different subsets, that is, using KSSL alone (kssl acronym), to eliminate potential errors due to analytical data being collected in different ways, or the full OSSL database (ossl acronym), in combination with three separate spectral types: VisNIR (visnir acronym), MIR (mir acronym), and NIR from the Neospectra instrument (nir.neospectra acronym). Only the nir.neospectra have a single model (ossl) as the differentiation would not be advantageous due to the limited sample size (Table 3).
[Figure omitted. See PDF.]
Hyperparameters were optimized through an internal (inner) re-sampling process using 5-fold cross-validation and smaller subsets (∼2000 samples) to speed up computing time [55]. This task was carried out with a grid search of the hyperparameter space testing up to five (5) Cubist configurations (commitees = [1, 5, 10, 15, 20]) to find the lowest root mean squared error (RMSE, Eq 4). The final model with the best optimal hyperparameters was fitted at the end with the full training data.
As predictors, we have used the first 120 principal components (PCs) of each spectral range and subset to maximize the balance i.e. trade-off between the spectral representation and compression magnitude employed by principal component analysis (PCA) [56, 57]. Each combination of models in Table 3) had a separate PCA model (Eq 1) fitted to compress the spectra of the training set and new samples. Additionally, as the first 120 components represent around 99% of the original variability, the remaining 1% was used to verify the extrapolation capacity of the calibration (see below), as this minor amount might contain specific absorption characteristics of unique soil samples that can make the calibration set unrepresentative and affect the predictive capacity.(1)where X is the original spectra that are decomposed into the PCA space with new coordinates denominated scores (T), estimated by the linear combination of loadings (P).
Considering that only the first 120 PCs are used to decompose X, the remaining higher-order components are denominated error (E), which is calculated by the difference between X and . This reasoning was employed in the OSSL models to identify potential underrepresented or outlier samples relative to the calibration set T120. In this task, each new spectrum (xi) to be predicted is back-transformed using only the first 120 PCs (). With the back-transformed version (), the residual of each sample () is used to calculate the Q statistic (Eq 2). The Qi statistic is compared against a critical value (upper confidence limit, UCLQ) estimated across the calibration set with a 99% confidence level (Eq 3) for identifying potential underrepresented samples when Qi > UCLQ [56, 58, 59].(2) (3)where θ1 is the trace (sum of the values on the main diagonal) of the covariance of E, defined as Cov(E), θ2 and θ3 are the trace of Cov(E)2 and Cov(E)3, respectively, , and zα is the standardized normal variable with (1 − α) confidence level (set 99%).
Except for clay, silt, sand, pH H2O, and pH CaCl2, all other soil properties were natural-log transformed (with offset = 1, log1p() R function) to improve the prediction performance of soil properties with highly skewed distribution [50]. They were back-transformed (expm1() R function) only at the end after running all the modeling steps, including performance estimation and definition of the uncertainty intervals.
The evaluation of the models was performed with a 10–fold external cross-validation (outer) of the tuned models using RMSE (Eq 4), mean error (bias, Eq 5), R-squared (R2, Eq 6), Lin’s concordance correlation coefficient (CCC, Eq 7), and the ratio of performance to the interquartile range (RPIQ, Eq 8) [60]:(4)(5)(6)(7)(8)where yi is the observed value, is the predicted value, is the mean value, n is the total number of samples, r is the Pearson correlation between y and , is the variance of observed values, is the variance of predicted values, is the mean of predicted values, and Q3 − Q1 is the interquartile distance of observed values.
Cross-validated predictions were also used to estimate the unbiased absolute error (), which is further used to calibrate the uncertainty of the estimation models via conformal prediction [61]. The conformal prediction is based on past experience, that is the error model (δ = f(T120)) is calibrated using the same fine-tuned structure of the respective response model (y = f(T120)). Non-conformity scores (α) were estimated from the predicted calibration error with a confidence level of 68% (ϵ) to approximate one standard deviation (Eq 9). Non-conformity scores from the calibration set are sorted in increasing order to identify the percentile corresponding to the predefined confidence level (αϵ). Finally, for new samples, the prediction interval (PI) can be derived from the predicted response () and the predicted error () corrected by the non-conformity score of the calibration set (Eq 10) [62]. The main advantage of conformal prediction is the coverage of the intervals; that is, a confidence of 80% means that the predicted range may contain the observed value for around or more than 80% of the cases [61]. Another big advantage of conformal prediction methods is that we get rid of re-sampling mechanisms such as bootstrapping or Monte Carlo simulation, which require a huge computation cost for large sample sizes and may not have good coverage of the true estimates. The non-conformity score and prediction interval are derived using:(9) (10)
For a final independent and external evaluation of the OSSL models, we have tested several datasets depending on the spectral range of interest. In this complementary evaluation, partial least squares regression (PLSR) models were also fitted and optimized with 10-fold cross-validation (up to 30 components) for benchmarking performance, as this algorithm is a standard choice in the chemometrics field [63]. For the MIR range, the first test data set is composed of 70 samples scanned by many instruments (n = 20), which were obtained as part of the ring trial experiment developed by the SS4GG initiative [22]. This data set helps us to understand the robustness of OSSL models against small spectral variations resulting from dissimilarities across instruments and SOPs. Predictions models of oc_usda.c729_w.pct, clay.tot_usda.a334_w.pct, ph.h2o_usda.a268_index, and k.ext_usda.a725_cmolc.kg were investigated.
The second data set for MIR evaluation was compiled from long-term research (LTR) trial sites found throughout the US, spanning different locations and treatments such as cover crops, straw removal, fertilization, and irrigation [64]. This data set is especially attractive because it emulates a scenario of OSSL application and its effects for monitoring organic carbon changes across agricultural soils. This data set is publicly available with two spectral versions (KSSL and Woodwell, n = 162), where only oc_usda.c729_w.pct has reference analytical data.
For the VisNIR range, we also used the ring trial samples, although with a smaller sample size (n = 60) because only this amount was prepared as fine earth (soil particle size <2 mm) and shipped around the world. The same soil properties were evaluated using VisNIR, i.e., oc_usda.c729_w.pct, clay.tot_usda.a334_w.pct, ph.h2o_usda.a268_index, and k.ext_usda.a725_cmolc.kg. Lastly, the NIR Neospectra prediction models were tested with 90 soil samples from Ghana, Kenya, and Nigeria available through the public Neospectra database [38]. These samples were not used during model calibration but rather kept separated for external validation. Prediction models of oc_usda.c729_w.pct, clay.tot_usda.a334_w.pct, ph.h2o_usda.a268_index, and k.ext_usda.a725_cmolc.kg were tested.
Results
OSSL resources
The OSSL has compiled and harmonized the spectra from eleven different SSLs (Table 1). This was an enormous effort to make soil spectroscopy data more accessible by sharing a single endpoint, where everyone can access and fetch global data specific to their needs. The OSSL Manual describes different methods for getting data [21], which includes compressed .CSV and serialized files (.qs format) hosted on Google Cloud Storage, data collections imported to a NoSQL MongoDB, and several application programming interfaces (API) endpoints for filtering and selecting data from the database. Similarly, a graphical user interface (GUI) denominated OSSL Explorer [65] was also designed for exploring the database; however, it contains only the data with available spatial coordinates, not the full database. In fact, from more than 135,000 entries in the OSSL, 87,707 samples have spatial coordinates linked to at least one spectral region and one soil property of interest, as described in Table 2. This explains why some geographical regions are more represented than others in specific spectral regions (Fig 3), e.g., the VisNIR range is more represented and concentrated over the European countries. Nonetheless, for those samples missing precise coordinates in the OSSL, county centroid coordinates were provided especially for the VisNIR and MIR spectra from the KSSL. In any case, the references of the original data producers are provided in the OSSL Manual if a user is interested in getting data with differing criteria. It is also important to mention that another available resource is the OSSL Engine [66], an estimation service where users can upload spectra collected on their instruments and get back predictions with the response, uncertainty, and representation flag.
[Figure omitted. See PDF.]
A: Locations with visible–near-infrared (VisNIR) spectra (350–2500 nm). B: Locations with mid-infrared (MIR) spectra (600–4000 cm−1). Note: not all OSSL samples have precise location data. From a total of 135,651 entries in the database, only 87,707 entries with varying complete information have precise location data. VisNIR samples in the US are plotted using county centroid coordinates.
10-fold cross-validation model performance
When OSSL was used to calibrate prediction models, the results for internal performance evaluation using 10-fold cross-validation (10CV) with refitting were highly variable depending on the soil property, the spectral region, and the subset of calibration (S2 Table). It is important to mention that we fitted separate models for the KSSL library to ensure that no systematic bias would be propagated to predictions due to the different analytical procedures used for defining the reference soil property values. This version of the model helped us to understand the effects of contrasting analytical methods on predictions and uncertainty, even after applying harmonization rules, when evaluating the larger models calibrated with the whole OSSL database.
As more than 140 models were fitted with the full OSSL database, we summarized internal evaluation performance by plotting Lin’s CCC against RPIQ values (Fig 4). This plot shows the overall capacity of the models when linking the spectral variations to the original range of soil properties. Higher RPIQ values can be achieved either by having more broadly-distributed reference values (i.e., higher IQR) or by having more precise models, when the original calibration set is limited in representation (i.e., lower IQR). Both maximization, higher IQR or lower RMSE, results in higher RPIQ. Therefore, a cut-off of 2 means that the original variability measured by IQR is twice as large as the average model error (precision). It does not account for high bias (model inaccuracy), i.e., a potential mismatch caused by a shift of the predicted values even when getting precise models; thus Lin’s CCC enables us to account for this issue because it encompasses both the precision and accuracy of a model in its calculation.
[Figure omitted. See PDF.]
Model performance among spectral ranges: Visible–near-infrared (350–2500 nm, visnir), mid-infrared (600–4000 cm−1, mir), and near-infrared (1350–2550 nm, nir.neospectra) specific to the Neospectra scanner (Si-Ware). Symbols reflect the models calibrated using the full OSSL database (ossl) or the Kellogg Soil Survey Laboratory alone (kssl). Note: Dashed lines are centered at Lin’s CCC of 0.7 and RPIQ of 2.
Setting thresholds for Lin’s CCC of 0.7 and RPIQ of 2, which convey both performance and extrapolation capacity, we observe that most of the models calibrated with the KSSL dataset were placed in the best performance quadrant (top right) (Fig 4). Similarly, the MIR range appears to be the spectral region with the best association with the variation in the original soil properties, followed by VisNIR and NIR (Neospectra), with NIR (Neospectra) having more limited models (bottom left quadrant, Fig 4). This classification indicates that it is important to use multiple goodness-of-fit metrics to account for the distinct types of errors (bias and variance) associated with the training capability (quantity and quality of samples) when assessing model performance.
Nonetheless, one can have an overview of the OSSL model’s calibration capacity ranked by soil property (Fig 5). In this visualization, all the combinations of spectral regions for the whole OSSL are ranked in decreasing order by Lin’s CCC and striped when RPIQ ≤ 2. Overall, MIR had most of the soil properties’ models with Lin’s CCC > 0.7 and RPIQ > 2 when compared to VisNIR and Neospectra NIR. MIR models that did not pass this threshold classification were total sulfur (s.tot_usda.a624_w.pct), electrical conductivity (ec_usda.a364_ds.m), boron (Mehlich3 method, b.ext_mel3_mg.kg), sodium (both NH4OAc method na.ext_usda.a726_cmolc.kg and Mehlich3 method na.ext_usda.a1068_mg.kg), zinc (Mehlich3 method, zn.ext_usda.a1073_mg.kg), and available water content (awc.33.1500kPa_usda.c80_w.frac). For the VisNIR region, inconsistent calibration was found for bulk density (bd_usda.a4_g.cm3), coarse fragments (cf_usda.c236_w.pct), electrical conductivity (ec_usda.a364_ds.m), sodium (NH4OAc method na.ext_usda.a726_cmolc.kg), phosphorus (Olsen method, p.ext_usda.a274_mg.kg), total sulfur (s.tot_usda.a624_w.pct), water retention at 1500 kPa (wr.1500kPa_usda.a417_w.pct), and water retention at 33 kPa (wr.33kPa_usda.a415_w.pct). Lastly, for the Neospectra NIR models, calibration challenges were found for 12 models: Total aluminum (Crystalline form, al.dith_usda.a65_w.pct), total aluminum (Amorphous form, al.ox_usda.a59_w.pct), available water content (awc.33.1500kPa_usda.c80_w.frac), bulk density (bd_usda.a4_g.cm3), electrical conductivity (ec_usda.a364_ds.m), total iron (Amorphous form, fe.ox_usda.a60_w.pct), potassium (NH4OAc method, k.ext_usda.a725_cmolc.kg), manganese (KCl method, mn.ext_usda.a70_mg.kg), sodium (NH4OAc method, na.ext_usda.a726_cmolc.kg), phosphorus (Mehlich3 method, p.ext_usda.a1070_mg.kg), total sulfur (s.tot_usda.a624_w.pct), and water retention at 33 kPa (wr.33kPa_usda.a415_w.pct). This indicates that some regions have more limitations than others when only the spectral variations are used as predictors in the models. It is also clear that for many chemical soil properties, the calibration accuracy drops significantly from MIR to VisNIR and NIR instruments.
[Figure omitted. See PDF.]
Lin’s concordance correlation coefficient (CCC) from 10-fold cross-validation with refitting. Striped bars represent models with a ratio of performance to the interquartile range (RPIQ) < 2.
Users interested in the graphical analysis of a specific combination of soil property, spectral region of interest, and calibration library (either KSSL or the full OSSL), can have access to hex-binned scatter plots produced during model evaluation (Fig 6). This type of visualization is included in the ossl-models GitHub repository [24] and readily available via the OSSL Engine [66] when selecting a model of interest.
[Figure omitted. See PDF.]
: A, B, and C are the accuracy plots of oc_usda.c729_w.pct from MIR, VisNIR, and Neospectra NIR models, respectively; D, E, and F are the accuracy plots for ph.h2o_usda.a268_index, respectively. All models have been fitted using Cubist. Accuracy results are based on 10–fold cross-validation with refitting.
Independent evaluation
Using the VisNIR ring trial test set for the evaluation of OSSL models revealed a big variation caused by the origin of the spectra (Table 4). When we calculate the average (mean) for each model type, PLSR achieves a higher Lin’s CCC (0.67) and RPIQ (1.15) than Cubist (0.60 and 1.05, respectively). The performance edge from PLSR seems to be consistent across soil properties and the ossl subset, especially for ph.h2o_usda.a268_index (PLSR average Lin’s CCC of 0.74 > Cubist average Lin’s CCC of 0.65) and k.ext_usda.a725_cmolc.kg (PLSR average Lin’s CCC of 0.47 > Cubist average Lin’s CCC of 0.28). For other soil properties, the results were less variable. In fact, considering that the VisNIR ring trial is composed of twelve different instruments that measured the same soil samples, most of the variation is due to the differences among them, laboratory conditions, and procedures. For example, clay.tot_usda.a334_w.pct, k.ext_usda.a725_cmolc.kg, and ph.h2o_usda.a268_index were the most sensitive soil properties to the test spectral composition (see the minimum statistics of Lin’s CCC and RPIQ), but PLSR found a way to deliver slightly better predictions for those contrasting setups. Although the overall results were not satisfactory in some cases, the independent evaluation showed that some instruments have a spectral composition similar enough to the OSSL library and can be used with the OSSL Cubist models (hosted in the OSSL Engine) with satisfactory performance for oc_usda.c729_w.pct, ph.h2o_usda.a268_index, and clay.tot_usda.a334_w.pct, with superior performance found when using the KSSL subset. Nonetheless, even with a good indication of performance on these test instruments, new samples to be predicted must be properly represented by the calibration feature space to avoid prediction bias.
[Figure omitted. See PDF.]
Statistics are reported for a sample of 12 distinct VisNIR instruments.
In turn, the MIR ring trial test set employed for the evaluation of OSSL models revealed differing results that favor the use of Cubist models (Table 5). The average (mean) performance of the Cubist models was higher (Lin’s CCC of 0.79 and RPIQ of 2.19) than the PSLR models (Lin’s CCC of 0.76 and RPIQ of 1.87). Similarly, we found that models calibrated with the whole OSSL are, on average, slightly superior (Lin’s CCC of 0.78 and RPIQ of 2.11) than those fitted with the KSSL alone (Lin’s CCC of 0.77 and RPIQ of 1.96). Considering these favorable findings towards the use of Cubist in combination with the whole OSSL MIR spectra, we found that c_usda.c729_w.pct (Lin’s CCC and RPIQ of 0.95 and 2.96, respectively), clay.tot_usda.a334_w.pct (0.74 and 1.81), and ph.h2o_usda.a268_index (0.0.84 and 2.92) achieved satisfactory overall performance regardless of the instrument type. For k.ext_usda.a725_cmolc.kg, the performance was more limited (Lin’s CCC and RPIQ of 0.61 and 1.48, respectively). In any case, some instruments are still subject to unsatisfactory results, as they seem to have unusual spectral variations that cause poor performance.
[Figure omitted. See PDF.]
Statistics are reported for a sample of 20 distinct MIR instruments. Metrics provided in original units after back-transformation.
The previous findings were also confirmed by another evaluation (Fig 7). The combination of the Cubist + OSSL subset resulted in a lower error (RMSE) for the estimation of organic carbon for agricultural experimental sites, regardless of the origin of the spectra (KSSL or Woodwell spectra). The KSSL subset was only comparable when the test spectra were collected using the same instrument. In addition, the PLSR RMSE was always higher than the Cubist for all cases. Additional metrics for MIR data from LTR sites are provided in S3 Table.
[Figure omitted. See PDF.]
ossl: models calibrated with the full OSSL. kssl: models calibrated with the Kellog Soil Survey Laboratory data alone. Note: Two different spectra versions are tested (n = 162), i.e. scanned both at Woodwell Climate and KSSL.
Lastly, the results of the independent evaluation of the OSSL models for the NIR (Neospectra) range using the African soil samples (Table 6) revealed that Cubist models are again generally better (mean Lin’s CCC and RPIQ of 0.55 and 1.40, respectively) than the baseline models fitted with PLSR (mean Lin’s CCC and RPIQ of 0.49 and 1.22, respectively). For the Cubist models, the best performance was found for oc_usda.c729_w.pct (Lin’s CCC and RPIQ of 0.85 and 1.53, respectively), while k.ext_usda.a725_cmolc.kg only achieved Lin’s CCC and RPIQ of 0.43 and 1.01, respectively. In fact, PLSR was only superior for this soil property. The results of the Neospectra independent validation show that, for this limited spectral range, the performance is inferior to other spectral ranges (VisNIR and MIR) and more generally useful only for a limited number of soil properties. In addition, the contrasting geographical origin between the training and test samples also may pose limitations on the true verification of the Neospectra models. When they were tested against the calibration feature space of the OSSL (US-only samples), all the African soil samples were flagged as represented by the spectral variation (based on the Q-statistics method).
[Figure omitted. See PDF.]
Discussion
Summary findings
The OSSL is formed by different data sets with variable numbers of soil properties and spectral ranges. This resulted in the calibration of 141 models with varying internal and external performance (Fig 4). However, it is essential to consider that for the first time, a global spectral library is freely available under an open data license that covers nearly all continents. This geographical representation is unprecedented since the first use of a spectrometer to quantify soil properties [67]. Each model was fitted on the basis of the specificity of the spectral range, spectrometer’s manufacturer (i.e., the NIR Neospectra models), two different calibration libraries (KSSL alone or full OSSL), wet chemistry methodology (some chemical soil properties have distinct methods available), and so on. Therefore, users can select the configuration that best fits their needs. While preconfigured models are provided, more advanced users can access the data and build any type of model that will suit their needs.
On average, the MIR range appears to be the best spectral region for developing spectral prediction models, followed by VisNIR and NIR (Neospectra), in which the latter can provide good external performance for oc_usda.c729_w.pct (i.e., soil organic carbon) and clay.tot_usda.a334_w.pct (i.e., clay content) given its limited spectral coverage and cost. This happens because the MIR contains several fundamental and resolved absorption features from mineral and organic functional groups that translate to better prediction capacity, despite challenges in the interpretation that stem from chemical heterogeneity [68]. VisNIR and NIR spectra, in turn, are made of overtones from the fundamental vibrations of the MIR range, hence, are less sensitive to soil constituents and may result in inferior performance. In addition, we found good performance for some soil properties that may not directly affect soil spectra but can be indirectly inferred and quantified (secondary properties), such as cation exchange capacity, pH, soil contaminants, etc. However, understanding both primary and secondary components in the soil helps to better understand the factors that contribute to the improvement of spectral predictive models within the complex context of soil systems and also to select the spectral range, where they are most pronounced. On average, the performance results of the OSSL models are consistent with findings documented in the literature [15, 17, 69, 70], where MIR-based models are significantly more accurate than VisNIR or NIR models.
We also found that the external performance (independent evaluation) is highly dependent on the characteristics of the test spectra, like spectral composition influenced by the instrument and standard operating procedures (SOPs). Using the experimental data from a ring trial [22], we found that VisNIR was more sensitive to instrument dissimilarity, with PLSR being the best-performing prediction algorithm. For the MIR and NIR (Neospectra) ranges, the Cubist models showed superior performance and were favored by the integration of several original SSLs that composed the OSSL. Nevertheless, we cannot exclude the effect of the origin and quality of the original datasets that make up the VisNIR calibration set. This yields variable results and the use of good SOPs with quality assurance and quality control (QA/QC) is an essential part of soil spectroscopy, regardless of the spectral range measured [9, 22].
Reliability of predictions
Building trust in predictive modeling requires the provision of complementary information, as the predictions might still be biased and/or with high variance even from satisfactory calibrated models. Going beyond the standard evaluation of models using goodness-of-fit metrics, the OSSL provides the precision of each prediction using the conformal prediction method (Fig 8). Different model types (Cubist and PLSR) resulted in different prediction intervals for soil organic carbon (SOC, oc_usda.c729_w.pct) when testing the LTR soil samples. Cubist models had a median width interval of 0.6 and 0.88, depending on the origin of the spectra, while PLSR resulted in broader median widths of 1.54 and 1.83. The assessment of all combinations (model type and spectra source) indicated coverage of the true values around or higher than the confidence level (95%, as the error probability was set to 5%), i.e., no model failed to translate its accuracy and precision to the predicted intervals. However, Cubist was more consistent than PLSR when combined with conformal prediction, as it resulted in the best goodness-of-fit model metrics coupled with a more approximate coverage to the predefined error and narrower uncertainty intervals.
[Figure omitted. See PDF.]
The left and right panels depict the uncertainty of Cubist and Partial Least Square Regression (PLSR) models, respectively. The top and bottom panels depict the uncertainty from the long-term research (LTR) sites scanned both in the Kellog Soil Survey Laboratory and Woodwell Climate, respectively. Note: all metrics are provided in the original units after backtransformation. The median width and coverage of the prediction intervals (PI95%) are provided for the uncertainty assessment.
Another element of the reliability assessment is the screening of potential outliers or underrepresented samples regarding the calibration feature space. This mechanism may help to understand the capacity of a model to extrapolate by simultaneously avoiding high bias on new predictions. There are different ways for this purpose but the soil spectroscopy literature has been promoting the use of process control charts that rely upon one or multiple quantitative methods [56, 71]. Q-statistics is routinely employed in control charts and has the advantage of leveraging compression or orthogonalization algorithms (like PCA and PLS) used in predictive modeling. However, the list of methods for process control is long, including Hotelling T2 test, F-ratio, SIMCA, etc [50, 71]. Further research still needs to be carried out especially because most of these methods rely on assumptions like the multivariate normal distribution and may be sensitive to heterogeneous soil spectral libraries, like the OSSL. In addition, the current orthogonalization implemented in the OSSL considers the full spectral space rather than the subset of every single combination of a soil property with the spectral region of interest, potentially yielding suboptimal results.
In the OSSL, Q-statistics was implemented by comparing the difference between the compressed back-transformed spectra and its original version with a critical value from the training set (Fig 9). In an illustration example, cropland soils from Massachusetts, USA, were flagged as underrepresented likely because they have distinct spectral patterns than the Neospectra calibration dataset. Not all soil types and geographical regions are properly represented in the calibration samples as it is a smaller representation of the KSSL soil survey database, so the Q-statistics was able to flag them based solely on spectral representation. This indicates that new soil sampling on underrepresented places or the design of new projects to specific needs (e.g., organic carbon change monitoring) will still be crucial to leverage the full potential of spectral prediction models without introducing prediction bias [8].
[Figure omitted. See PDF.]
In this example, samples from farmland soils (Massachusetts, USA) were evaluated against the NIR Neospectra OSSL model. All the test samples fell in the edge of the first two PCA spaces and were flagged as underrepresented regarding the full spectral feature space from the OSSL database. Note: The Q-statistics method uses all the retained principal components (in our study, 120 components) for flagging underrepresented spectra, not only the first two components depicted in the illustration. Figure sourced from the OSSL Engine [66].
An open and reproducible science initiative
The OSSL is a genuine open science project, meaning that it is a reproducible project aimed at the early sharing of soil data, research, and technological developments including open data, open standards, open source software, and open communication [72]. In fact, one of the main pillars of SS4GG is to support open innovation, so that all relevant knowledge actors are included and involved in the project by posting most of the input data sets, codes, and outputs on Github and similar repositories. Thus, all future exploitable activities of research outputs can be considered Findable, Accessible, Interoperable, and Reusable (FAIR). In this sense, OSSL can be considered as an open source driver for soil spectroscopy comparable to the widely used solutions for geospatial data, for example, the Geospatial Data Abstraction Library (GDAL) [73] and the Global Biodiversity Information Facility (GBIF) [74], to mention just a few projects that have especially inspired us.
There are limitations to fully open research especially regarding data privacy. One could argue what is more important for the global good: (A) the privacy and data ownership rights of farmers/land owners, or (B) the common interest of the public. Without proposing that either of the two is more important, we (the authors in this article) in fact support both paths, as the world’s soils are currently endangered [75, 76] and we need all possible efforts, open-source and commercial, to help conserve, regenerate, and improve soils in landscapes. In that sense, OSSL could potentially also host a number of proprietary datasets, provided that at least the derived models can be served as open data (Fig 10). However, we hope that most of the legacy datasets will eventually be released under an open data license and be included in the OSSL, especially those that have already been funded by public money through taxpayers.
[Figure omitted. See PDF.]
Many datasets can be included in OSSL; however, some minimum license compatibility is needed. OSSL accepts a diversity of soil spectral data sets as long as at least the produced models (model parameters) are shared under an open data license.
Limitations and future development
Most of the soil variability driven by physical, chemical, and biological characteristics is represented by distinct features present in the spectra [15]. This is one of the factors that makes soil spectroscopy so compelling. However, some properties have strong signatures, others are only indirectly related. Similarly, external factors such as geographical characteristics (slope, rainfall regime, etc.) and soil management practices (fertilization, tillage, etc.) can cause variations to soil properties and either change the spectral characteristics or remain undetectable, affecting the prediction capacity [77]. This additional information can either be integrated as ancillary features into the prediction models (discussed further) or be incorporated as metadata if new samples from distinct locations and spanning different conditions are added to the OSSL. In fact, the geographical representation of OSSL points is biased, especially towards Europe and the USA (Fig 3). However, the OSSL compilation was an enormous amount of work and can still accommodate new data, especially from underrepresented locations and soil conditions. Despite this fact, several studies have explored the OSSL in global analysis [78, 79], which demonstrates the power of open data in leveraging new research [19].
We envision that soil spectroscopy in the near future will become more accessible, more affordable, and better integrated into the measurement, monitoring, reporting, and verification (MMRV) of soil status and ecosystem services. This is simply because soil monitoring is a growing market, and soil spectroscopy will inevitably play a role in that growth. The following three aspects of the development of soil spectroscopy seem to be especially promising: (1) miniaturization of field-scale spectrometers, (2) effective international integration for building global soil spectral libraries and standard operating procedures, and (3) continued research and development of learning algorithms for improving predictive capacity.
The miniaturization and cost reduction of handheld field spectrometers, especially in the NIR and MIR range, can revolutionize soil spectroscopy adoption and increase the number of applications multifold [80]. Several interesting handheld or portable spectral scanners are used today to generate thousands of soil scans, e.g. Neospectra™ [81, 82] and trinamiX™. Such light and portable NIR spectrometers could be used to take hundreds of georeferenced soil scans per day, improving the representation of variable soil landscapes. However, field soil spectroscopy has many challenges that are not common to laboratory soil spectroscopy, where conditions are more controlled. Soil moisture content, variable surface conditions, and environment conditions (for passive instruments) are just a few worth mentioning that have a large impact on the quality of the field spectra, and therefore, must be properly accounted for when integrating or working with dry-scan, laboratory-based soil spectral libraries, like the OSSL [83]. Nonetheless, with appropriate labeling and metadata compilation, the OSSL can accommodate different levels of spectral acquisition (e.g. laboratory and field) and enable adequate interoperability solutions.
In terms of international cooperation and integration, the IEEE AS P4005 working group has been working to define standards and protocols for laboratory, field, and remote sensing soil spectroscopy. FAO’s GLOSOLAN has also been defining the standard operating procedures (SOPs) for four categories of soil properties useful for international projects and extended soil health monitoring [84, 85], the same that were compiled in the OSSL depending on the availability of the original datasets. Likewise, GLOSOLAN has been leading the discussion and the definition of best practices for ensuring quality in the process of generating soil information via analytical wet chemistry. Not only the harmonization of analytical procedures is critical for soil data generation, but also the standardization across instruments and spectroscopy laboratories. A dedicated branch was created for dry chemistry (i.e. soil spectroscopy) in the GLOSOLAN network to promote and establish soil spectroscopy as a clean and sustainable solution for generating soil information across routine and research laboratories. In a parallel effort that was inspired by the previous, SS4GG has recently established a network with more than 20 participants across the world to assess dissimilarities across instruments and laboratories. The first assessment report has been published for the MIR range [22], but we hope that continued research and novel solutions keep coming to ensure better interoperability of soil spectroscopy laboratories and the OSSL.
For the third point on enhancing predictive performance, although the OSSL models showed promising results, several approaches can potentially provide more reliable results on any spectra uploaded to the OSSL. One improvement path would be incorporating ancillary information during model calibration upon the provision of coordinates, date, and depth for new samples. Many approaches can be employed to improve the prediction performance of spectral models as we have indicated that some spectral regions achieved limited prediction performance (Fig 5). Multisensor and data fusion appear promising, as they support exploring complex interactions and variations of the landscape by the incorporation of spatial/temporal context in the models, minimizing in this case, spatial clustering or the overrepresentation of topsoil (Fig 11). One limitation of such an approach, however, is that all the values of the covariates are needed for new samples, and hence this approach is heavily dependent on the availability of up-to-date data and user input of precise spatial and temporal sampling information. This, in fact, led to the simplification of the first OSSL models to rely just on spectral variations and specific ranges, as more sophisticated approaches would create barriers to end-user adoption.
[Figure omitted. See PDF.]
Variables and data sources of interest for producing spatially explicit calibration models. This however requires that all soil samples are fully georeferenced and that covariate layers are available for new prediction locations (including forecasted predictions of weather conditions and similar). Portable NIR scanner in the picture: Neospectra Scanner covering the range from 1350 to 2500 nm.
Today, many detailed auxiliary spatial-temporal data are available, even with global coverage, that could be used to improve the calibration of in situ soil spectral libraries. For example, soil moisture often drastically changes spectral signatures that negatively impact the predictions, so incorporating a detailed estimate of daily soil moisture may help improve the accuracy of spectral models [86–88]. Spectral reflectance information from Earth Observation systems, e.g., VisNIR reflectances from Sentinel-2 or Landsat, could be explored with a special focus on bare/no-vegetation spectral reflectance of the land surface [89, 90]. Nonetheless, EO spectral data may have delays and yield additional costs, as it often takes a significant amount of time to process and prepare cloud-free, bare-earth, or similar high-quality products. Another critical factor about integrating EO reflectance products with field measurements is that, although there are several ground observation points with reference spectral data, the compatibility between the different platforms (field sensor, aerial, orbital) still seems to pose a big barrier and requires further research and development for proper integration.
Model localization is another approach that seems very promising in predictive soil spectroscopy but may pose some restrictions when building estimation services as they often require recursive processing and retraining of model parameters [91, 92]. In addition, local models can produce unreliable results if there is a low similarity between the local and reference library [93], indicating that compilation of large spectral libraries (like the OSSL) is critical for increasing the representation of variable soil types and conditions. Generally applicable models, like those employed in the OSSL, have less of this localization problem but might fail by delivering biased predictions on local samples. Cutting-edge machine learning methods, especially those under the umbrella of deep learning, could be further explored to get the best out of large datasets with multitask learning (prediction of multiple soil properties at the same time), transfer learning between ranges (VisNIR, NIR, and MIR) and domains (laboratory, field and remote), and also allowing the public storage and distribution of pre-trained large models. In fact, both deep learning and localization approaches have been recently demonstrated to work well when combined with the Neospectra NIR for predicting soil organic carbon (25% improvement over the current OSSL framework) [94]. This happened after the promotion of a community data science competition on Kaggle as part of the SS4GG event series, indicating an effective path for community-based model improvement and testing of new ideas with the OSSL.
All things considered, we presented in this paper the OSSL and other open resources built as part of the Soil Spectroscopy for Global Good initiative [20]. The OSSL, a comprehensive and curated database of soil spectra from diverse sources and regions, enabled the development of reproducible soil spectral calibration models that can serve as a reference framework, although there is still room for improvements and adaptation of novel algorithms [24]. We demonstrate the implications of the OSSL by compiling and validating a predictive modeling procedure for several soil properties, and by making all the resources publicly available (data, models, and code). We hope this open data and open science effort contributes to the advancement of soil spectroscopy as a global good. The benefits of open data principles are manifold, but in this case, it may allow for building a collaborative and inclusive community of soil spectroscopy practitioners and stakeholders [18, 19]. We also hope that this effort guides future efforts to improve the usability of soil spectral data and models to drive a better sustainable use and management of soils with fast, affordable, repeatable, and reproducible measurements.
Please, contribute!
OSSL is a genuine open science project with (both) open-source and open-data licenses [18, 19]. We aim to develop and improve calibration models through open science hackathons and workshops. All of our code can be found online under the MIT license; a versioned backup copy of the data is also available via Zenodo under the CC-BY license. Both licenses allow you to extend, build on, and even build commercial businesses on top of this data and code. The data we have created have already been used in some projects [78, 79]. Help us create better models for global good and especially to support nature-based solutions, regenerative agriculture, and environmental monitoring projects. If you are the owner of soil spectral libraries, consider donating your data to the project. Your data will be systematically imported into OSSL and used to update global and local soil spectroscopy calibration models; your attribution and citation requests will be carefully followed.
SS4GG has been encouraging researchers and entities (public and private) to support the project by sharing any SSL. We have been promoting four modes of data sharing: (i) publishing the SSL with an open access license by releasing it under a Creative Commons license (CC-BY, CC-BY-SA) or the Open Data Commons Open Database License (ODbL), so we can directly import into the OSSL with proper attribution; (ii) donating a small portion (e.g. 5–10%) of the dataset under CC-BY, CC-BY-SA and/or ODbL, which can also be directly imported into the OSSL with proper attribution; (iii) allowing the SS4GG initiative to directly access the data so that we can perform data mining and then release only the models under an open data license; and/or (iv) using the OSSL database to produce new derivative products, then sharing them through our own infrastructures or contact us for providing hosting support. In addition, we can sign professional data-sharing agreements with data producers that specify in detail how the data will be used. Our primary interest is in enabling research, sharing, and use of models (calibration and prediction) in collaboration with groups across borders.
Conclusion
We have described a comprehensive and curated global soil spectral library called “Open Soil Spectral Library” (OSSL) [21] which is based on open data principles for building a collaborative and inclusive community of soil spectroscopy practitioners and stakeholders. The OSSL is composed of a suite of datasets, reproducible code, web services, and tutorials. It is one of the main outputs of the Soil Spectroscopy for Global Good (SS4GG) [20] project which aims at bridging gaps between new technology and wider use for soil monitoring. The results of model cross-validation (10–fold with refitting) for MIR and Vis-NIR spectra show that, in general, MIR-based models are more accurate than VisNIR-based models, especially for determining chemical soil properties. From an independent test evaluation, the Cubist comes out as the best-performing algorithm for global calibration and generation of additional reliable outputs, i.e., prediction uncertainty and representation flag. Although many soil properties are well predicted, some properties such as total sulfur, extractable sodium, and electrical conductivity, performed poorly in all spectral regions, with some other extractable nutrients and physical soil properties (coarse fragments, soil water retention parameters, and bulk density) also performing inadequately in one or two spectral regions (VisNIR or Neospectra NIR). Hence, the use of predictive models based solely on spectral variations has limitations. With this genuinely open-source and open-access project, we hope that OSSL becomes a driver of the soil spectroscopy community to accelerate the pace of scientific discovery and innovation.
Supporting information
S1 Table. Summary statistics of soil properties grouped by spectral region imported into the Open Soil Spectral Library (OSSL): Soil property descriptions are supplied in Table 2.
https://doi.org/10.1371/journal.pone.0296545.s001
S2 Table. Goodness-of-fit metrics from 10-fold cross-validation with refitting, calculated across different soil properties and model types: Root mean squared error (RMSE), mean error (bias), coefficient of determination (R2), Lin’s concordance correlation coefficient (CCC), ratio of performance to the interquartile range (RPIQ).
Note: Except for clay.tot_usda.a334_w.pct. silt.tot_usda.c62_w.pct, sand.tot_usda.c60_w.pct, ph.h2o_usda.a268_index, and ph.cacl2_usda.a481_index, all the other soil properties’ metrics are reported in the natural logarithm space (with offset = 1, log1p() R function).
https://doi.org/10.1371/journal.pone.0296545.s002
S3 Table. Goodness-of-fit metrics of organic carbon (oc_usda.c729_w.pct) from an independent evaluation of the mid-infrared (MIR) models calibrated with the Open Soil Spectral Library (OSSL) database.
Statistics are reported for a dataset representing long-term research (LTR) trial sites. Metrics provided in original units after back-transformation.
https://doi.org/10.1371/journal.pone.0296545.s003
S1 Fig. Spectral variability represented by the first two principal components (PCs) of the Visible and Near-Infrared (VisNIR) spectra colored by different original datasets imported into the OSSL.
https://doi.org/10.1371/journal.pone.0296545.s004
(TIF)
S2 Fig. Spectral variability represented by the first two principal components (PCs) of the Mid-Infrared (MIR) spectra colored by different original datasets imported into the OSSL.
https://doi.org/10.1371/journal.pone.0296545.s005
(TIF)
S3 Fig. Spectral variability represented by the first two principal components (PCs) of the Neospectra Near-Infrared (NIR) imported into the OSSL.
https://doi.org/10.1371/journal.pone.0296545.s006
(TIF)
Acknowledgments
We are grateful to the USDA NRCS National Soil Survey Center—Kellogg Soil Survey Laboratory, ICRAF-World Agroforestry, ISRIC-World Soil Information, the Africa Soil Information Service (AfSIS), the European Soil Data Centre, ETH Zurich, and the soil spectroscopy community for publishing and providing high-quality data that the OSSL can build upon. We are also grateful to all contributors to our open-source repositories and materials [21, 24].
References
1. 1. Tóth G, Hermann T, da Silva MR, Montanarella L. Monitoring soil for sustainable development and land degradation neutrality. Environmental Monitoring and Assessment. 2018;190(2). pmid:29302746
* View Article
* PubMed/NCBI
* Google Scholar
2. 2. McBratney A, Field DJ, Koch A. The dimensions of soil security. Geoderma. 2014;213:203–213.
* View Article
* Google Scholar
3. 3. Bradford MA, Carey CJ, Atwood L, Bossio D, Fenichel EP, Gennet S, et al. Soil carbon science for policy and practice. Nature Sustainability. 2019;2(12):1070–1072.
* View Article
* Google Scholar
4. 4. Nature editors. Safeguarding our soils. Nature Communications. 2017;8(1). pmid:29208945
* View Article
* PubMed/NCBI
* Google Scholar
5. 5. Rojas RV, Achouri M, Maroulis J, Caon L. Healthy soils: a prerequisite for sustainable food security. Environmental Earth Sciences. 2016;75(3).
* View Article
* Google Scholar
6. 6. Shepherd KD, Ferguson R, Hoover D, van Egmond F, Sanderman J, Ge Y. A global soil spectral calibration library and estimation service. Soil Security. 2022;7:100061.
* View Article
* Google Scholar
7. 7. Wadoux AMJC, Heuvelink GBM, Lark RM, Lagacherie P, Bouma J, Mulder VL, et al. Ten challenges for the future of pedometrics. Geoderma. 2021;401:115155.
* View Article
* Google Scholar
8. 8. Hendriks CMJ, Stoorvogel JJ, Lutz F, Claessens L. When can legacy soil data be used, and when should new data be collected instead? Geoderma. 2019;348:181–188.
* View Article
* Google Scholar
9. 9. Viscarra-Rossel RA, Behrens T, Ben-Dor E, Chabrillat S, Demattê JAM, Ge Y, et al. Diffuse reflectance spectroscopy for estimating soil properties: A technology for the 21st century. European Journal of Soil Science. 2022;73(4).
* View Article
* Google Scholar
10. 10. Li S, Viscarra Rossel RA, Webster R. The cost-effectiveness of reflectance spectroscopy for estimating soil organic carbon. European Journal of Soil Science. 2021;73(1).
* View Article
* Google Scholar
11. 11. Ben-Dor E, Banin A. Near-infrared analysis as a rapid method to simultaneously evaluate several soil properties. Soil Science Society of America Journal. 1995;59(2):364–372.
* View Article
* Google Scholar
12. 12. Nocita M, Stevens A, van Wesemael B, Aitkenhead M, Bachmann M, Barthès B, et al. Soil spectroscopy: an alternative to wet chemistry for soil monitoring. Advances in agronomy. 2015;132:139–159.
* View Article
* Google Scholar
13. 13. Ben-Dor E, Irons J, Epema G. Soil reflectance. In: Rencz AN, Ryerson RA, editors. Remote Sensing for the Earth Sciences: Manual of Remote Sensing. vol. 3. John Wiley and Sons; 1999. p. 111–188.
14. 14. Viscarra-Rossel R, Behrens T, Ben-Dor E, Brown D, Demattê J, Shepherd KD, et al. A global spectral library to characterize the world’s soil. Earth-Science Reviews. 2016;155:198–230.
* View Article
* Google Scholar
15. 15. Soriano-Disla JM, Janik LJ, Viscarra Rossel RA, Macdonald LM, McLaughlin MJ. The Performance of Visible, Near-, and Mid-Infrared Reflectance Spectroscopy for Prediction of Soil Physical, Chemical, and Biological Properties. Applied Spectroscopy Reviews. 2013;49(2):139–186.
* View Article
* Google Scholar
16. 16. Mastorakis G. Human-like machine learning: limitations and suggestions. CoRR. 2018;abs/1811.06052.
* View Article
* Google Scholar
17. 17. Demattê JA, Paiva AFdS, Poppiel RR, Rosin NA, Ruiz LFC, Mello FAdO, et al. The Brazilian Soil Spectral Service (BraSpecS): A User-Friendly System for Global Soil Spectra Communication. Remote Sensing. 2022;14(3):740.
* View Article
* Google Scholar
18. 18. Burgelman JC, Pascu C, Szkuta K, Von Schomberg R, Karalopoulos A, Repanas K, et al. Open Science, Open Data, and Open Scholarship: European Policies to Make Science Fit for the Twenty-First Century. Frontiers in Big Data. 2019;2. pmid:33693366
* View Article
* PubMed/NCBI
* Google Scholar
19. 19. McKiernan EC, Bourne PE, Brown CT, Buck S, Kenall A, Lin J, et al. How open science helps researchers succeed. eLife. 2016;5. pmid:27387362
* View Article
* PubMed/NCBI
* Google Scholar
20. 20. Soil Spectroscopy for Global Good. Soil Spectroscopy website; 2024. Available from: https://soilspectroscopy.org/.
21. 21. Soil Spectroscopy for Global Good. Open Soil Spectral Library; 2024. Available from: https://soilspectroscopy.github.io/ossl-manual/.
22. 22. Safanelli JL, Sanderman J, Bloom D, Todd-Brown K, Parente LL, Hengl T, et al. An interlaboratory comparison of mid-infrared spectra acquisition: Instruments and procedures matter. Geoderma. 2023;440:116724.
* View Article
* Google Scholar
23. 23. Garrity D, Bindraban P. A globally distributed soil spectral library visible near infrared diffuse reflectance spectra. Nairobi, Kenya: ICRAF (World Agroforestry Centre) / ISRIC (World Soil Information) Spectral Library; 2004. Available from: https://doi.org/10.34725/DVN/MFHA9C.
24. 24. Soil Spectroscopy for Global Good. Soil Spectroscopy GitHub repositories; 2024. Available from: https://github.com/soilspectroscopy.
25. 25. United States Department of Agriculture, National Cooperative Soil Survey. National Cooperative Soil Survey Lab Data Mart (Lab Data); 2023. Available from: https://ncsslabdatamart.sc.egov.usda.gov/querypage.aspx.
26. 26. Seybold CA, Ferguson R, Wysocki D, Bailey S, Anderson J, Nester B, et al. Application of Mid-Infrared Spectroscopy in Soil Survey. Soil Science Society of America Journal. 2019;83(6):1746–1759.
* View Article
* Google Scholar
27. 27. Wijewardane NK, Ge Y, Wills S, Libohova Z. Predicting physical and chemical properties of US soils with a mid-infrared reflectance spectral library. Soil Science Society of America Journal. 2018;82(3):722–731.
* View Article
* Google Scholar
28. 28. Wills S, Loecke T, Sequeira C, Teachman G, Grunwald S, West LT. Overview of the U.S. Rapid Carbon Assessment Project: Sampling Design, Initial Summary and Uncertainty Estimates. In: Hartemink A, McSweeney K, editors. Soil Carbon. Progress in Soil Science. Springer International Publishing; 2014. p. 95–104. Available from: https://doi.org/10.1007/978-3-319-04084-4_10.
29. 29. Wijewardane NK, Ge Y, Wills S, Loecke T. Prediction of Soil Carbon in the Conterminous United States: Visible and Near Infrared Reflectance Spectroscopy Analysis of the Rapid Carbon Assessment Project. Soil Science Society of America Journal. 2016;80(4):973–982.
* View Article
* Google Scholar
30. 30. World Agroforestry (ICRAF), International Soil Reference and Information Centre (ISRIC). ICRAF-ISRIC Soil Spectral Library; 2021. Available from: https://data.worldagroforestry.org/citation?persistentId=doi:10.34725/DVN/MFHA9C.
31. 31. Vagen TG, Winowiecki LA, Desta L, Tondoh EJ, Weullow E, Shepherd K, et al. Mid-Infrared Spectra (MIRS) from ICRAF Soil and Plant Spectroscopy Laboratory: Africa Soil Information Service (AfSIS) Phase I 2009-2013; 2020. Available from: https://doi.org/10.34725/DVN/QXCWP1.
32. 32. Hengl T, Miller MA, Križan J, Shepherd KD, Sila A, Kilibarda M, et al. African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning. Scientific Reports. 2021;11(1):1–18. pmid:33731749
* View Article
* PubMed/NCBI
* Google Scholar
33. 33. Jones A, Fernandez-Ugalde O, Scarpa S. LUCAS 2015 topsoil survey: presentation of dataset and results. European Commission. Joint Research Centre. Publications Office; 2020. Available from: https://data.europa.eu/doi/10.2760/616084.
34. 34. Summerauer L, Baumann P, Ramirez-Lopez L, Barthel M, Bauters M, Bukombe B, et al. The central African soil spectral library: a new soil infrared repository and a geographical prediction analysis. SOIL. 2021;7(2):693–715.
* View Article
* Google Scholar
35. 35. Garrett LG, Sanderman J, Palmer DJ, Dean F, Patel S, Bridson JH, et al. Mid-infrared spectroscopy for planted forest soil and foliage nutrition predictions, New Zealand case study. Trees, Forests and People. 2022;8:100280.
* View Article
* Google Scholar
36. 36. Schiedung M, Bellè SL, Malhotra A, Abiven S. Organic carbon stocks, quality and prediction in permafrost-affected forest soils in North Canada. CATENA. 2022;213:106194.
* View Article
* Google Scholar
37. 37. Jović B, Ćirić V, Kovačević M, Šeremešić S, Kordić B. Empirical equation for preliminary assessment of soil texture. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 2019;206:506–511. pmid:30176426
* View Article
* PubMed/NCBI
* Google Scholar
38. 38. Sanderman J, Smith C, Safanelli JL, Mitu SM, Ge Y, Murad O, et al. Near-infrared (NIR) soil spectral library using the NeoSpectra Handheld NIR Analyzer by Si-Ware; 2023. Available from: https://zenodo.org/record/7600137.
* View Article
* Google Scholar
39. 39. Francos N, Gholizadeh A, Demattê JAM, Ben-Dor E. Effect of the internal soil standard on the spectral assessment of clay content. Geoderma. 2022;420:115873.
* View Article
* Google Scholar
40. 40. Cools N, Delanote V, Scheldeman X, Quataert P, Vos BD, Roskams P. Quality assurance and quality control in forest soil analyses: a comparison between European soil laboratories. Accreditation and Quality Assurance. 2004;9:688–694.
* View Article
* Google Scholar
41. 41. Pimstein A, Notesco G, Ben-Dor E. Performance of Three Identical Spectrometers in Retrieving Soil Reflectance under Laboratory Conditions. Soil Science Society of America Journal. 2011;75:746–759.
* View Article
* Google Scholar
42. 42. FAO Global Soil Partnership. Standard Operating Procedures (SOPs); 2023. Available from: https://www.fao.org/global-soil-partnership/glosolan-old/soil-analysis/standard-operating-procedures/en/.
43. 43. Soil Survey Staff. Kellogg Soil Survey Laboratory methods manual. Soil Survey Investigations Report No. 42, Version 6.0. U.S. Department of Agriculture, Natural Resources Conservation Service.; 2022. Available from: https://www.nrcs.usda.gov/resources/guides-and-instructions/kssl-guidance.
44. 44. ISRIC World Soil Information. International Soil Standards; 2023. Available from: https://www.isric.org/international-soil-standards.
45. 45. Bispo A, Andersen L, Angers DA, Bernoux M, Brossard M, Cécillon L, et al. Accounting for Carbon Stocks in Soils and Measuring GHGs Emission Fluxes from Soils: Do We Have the Necessary Standards? Frontiers in Environmental Science. 2017;5.
* View Article
* Google Scholar
46. 46. Stevens A, Ramirez-Lopez L. An introduction to the prospectr package; 2022.
* View Article
* Google Scholar
47. 47. Lang M, Binder M, Richter J, Schratz P, Pfisterer F, Coors S, et al. mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software. 2019.
* View Article
* Google Scholar
48. 48. Quinlan J. Learning with continuous classes. In: Proc. 5th Australian Joint Conference on Artificial Intelligence, Tasmania, 1992; 1992. p. 343–348.
49. 49. Quinlan J. Combining instance-based and model-based learning. In: Proc. Tenth Int. Conference on Machine Learning; 1993. p. 236–243.
50. 50. Dangal S, Sanderman J, Wills S, Ramirez-Lopez L. Accurate and Precise Prediction of Soil Properties from a Large Mid-Infrared Spectral Library. Soil Systems. 2019;3(1):11.
* View Article
* Google Scholar
51. 51. Minasny B, McBratney AB. Regression rules as a tool for predicting soil properties from infrared reflectance spectroscopy. Chemometrics and Intelligent Laboratory Systems. 2008;94(1):72–79.
* View Article
* Google Scholar
52. 52. Barnes RJ, Dhanoa MS, Lister SJ. Standard Normal Variate Transformation and De-Trending of Near-Infrared Diffuse Reflectance Spectra. Applied Spectroscopy. 1989;43(5):772–777.
* View Article
* Google Scholar
53. 53. Barra I, Haefele SM, Sakrabani R, Kebede F. Soil spectroscopy with the use of chemometrics, machine learning and pre-processing techniques in soil diagnosis: Recent advances—A review. TrAC Trends in Analytical Chemistry. 2021;135:116166.
* View Article
* Google Scholar
54. 54. Vestergaard RJ, Vasava H, Aspinall D, Chen S, Gillespie A, Adamchuk V, et al. Evaluation of Optimized Preprocessing and Modeling Algorithms for Prediction of Soil Properties Using VIS-NIR Spectroscopy. Sensors. 2021;21(20):6745. pmid:34695958
* View Article
* PubMed/NCBI
* Google Scholar
55. 55. Yang L, Shami A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing. 2020;415:295–316.
* View Article
* Google Scholar
56. 56. de Santana FB, Hall RL, Lowe V, Browne MA, Grunsky EC, Fitzsimons MM, et al. A systematic approach to predicting and mapping soil particle size distribution from unknown samples using large mid-infrared spectral libraries covering large-scale heterogeneous areas. Geoderma. 2023;434:116491.
* View Article
* Google Scholar
57. 57. Chang CW, Laird DA, Mausbach MJ, Hurburgh CR. Near-Infrared Reflectance Spectroscopy-Principal Components Regression Analyses of Soil Properties. Soil Science Society of America Journal. 2001;65(2):480–490.
* View Article
* Google Scholar
58. 58. Jackson JE, Mudholkar GS. Control Procedures for Residuals Associated With Principal Component Analysis. Technometrics. 1979;21(3):341–349.
* View Article
* Google Scholar
59. 59. Laursen K, Rasmussen MA, Bro R. Comprehensive control charting applied to chromatography. Chemometrics and Intelligent Laboratory Systems. 2011;107(1):215–225.
* View Article
* Google Scholar
60. 60. Wadoux AMJC, Malone B, McBratney AB, Fajardo M, Minasny B. Soil Spectral Inference with R: Analysing Digital Soil Spectra Using the R Programming Environment. Progress in Soil Science. Springer International Publishing; 2021.
61. 61. Norinder U, Carlsson L, Boyer S, Eklund M. Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination. Journal of Chemical Information and Modeling. 2014;54(6):1596–1603. pmid:24797111
* View Article
* PubMed/NCBI
* Google Scholar
62. 62. Cortés-Ciriano I, van Westen GJP, Bouvier G, Nilges M, Overington JP, Bender A, et al. Improved large-scale prediction of growth inhibition patterns using the NCI60 cancer cell line panel. Bioinformatics. 2015;32(1):85–95. pmid:26351271
* View Article
* PubMed/NCBI
* Google Scholar
63. 63. Geladi P, Kowalski BR. Partial least-squares regression: a tutorial. Analytica Chimica Acta. 1986;185:1–17.
* View Article
* Google Scholar
64. 64. Sanderman J, Savage K, Dangal SRS, Duran G, Rivard C, Cavigelli MA, et al. Can Agricultural Management Induced Changes in Soil Organic Carbon Be Detected Using Mid-Infrared Spectroscopy? Remote Sensing. 2021;13(12):2265.
* View Article
* Google Scholar
65. 65. Soil Spectroscopy for Global Good. Open Soil Spectral Library (OSSL) Explorer application; 2024. Available from: https://explorer.soilspectroscopy.org/.
66. 66. Soil Spectroscopy for Global Good. Open Soil Spectral Library (OSSL) Engine application; 2024. Available from: https://engine.soilspectroscopy.org/.
67. 67. Stoner ER, Baumgardner M. Characteristic variations in reflectance of surface soils. Soil Science Society of America Journal. 1981;45(6):1161–1165.
* View Article
* Google Scholar
68. 68. Margenot AJ, Calderón FJ, Goyne KW, Mukome FND, Parikh SJ. IR Spectroscopy, Soil Analysis Applications. In: Lindon JC, Tranter GE, Koppenaal DW, editors. Encyclopedia of Spectroscopy and Spectrometry. 3rd ed. Academic Press; 2017. p. 448–454. Available from: https://www.sciencedirect.com/science/article/pii/B9780124095472121705.
69. 69. Viscarra-Rossel RA, Walvoort DJJ, McBratney AB, Janik LJ, Skjemstad JO. Visible, near infrared, mid infrared or combined diffuse reflectance spectroscopy for simultaneous assessment of various soil properties. Geoderma. 2006;131(1–2):59–75.
* View Article
* Google Scholar
70. 70. Ng W, Minasny B, Jeon SH, McBratney A. Mid-infrared spectroscopy for accurate measurement of an extensive set of soil properties for assessing soil functions. Soil Security. 2022;6:100043.
* View Article
* Google Scholar
71. 71. De Maesschalck R, Jouan-Rimbaud D, Massart DL. The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems. 2000;50(1):1–18.
* View Article
* Google Scholar
72. 72. Todd-Brown KEO, Abramoff RZ, Beem-Miller J, Blair HK, Earl S, Frederick KJ, et al. Reviews and syntheses: The promise of big diverse soil data, moving current practices towards future potential. Biogeosciences. 2022;19(14):3505–3522.
* View Article
* Google Scholar
73. 73. GDAL/OGR contributors. GDAL/OGR Geospatial Data Abstraction Software Library; 2023. Available from: https://gdal.org.
74. 74. Telenius A. Biodiversity information goes public: GBIF at your service. Nordic Journal of Botany. 2011;29(3):378–381.
* View Article
* Google Scholar
75. 75. Karlen DL, Rice CW. Soil Degradation: Will Humankind Ever Learn? Sustainability. 2015;7(9):12490–12501.
* View Article
* Google Scholar
76. 76. Gonzalez-Roglich M, Zvoleff A, Noon M, Liniger H, Fleiner R, Harari N, et al. Synergizing global tools to monitor progress towards land degradation neutrality: Trends. Earth and the World Overview of Conservation Approaches and Technologies sustainable land management database. Environmental Science & Policy. 2019;93:34–42.
* View Article
* Google Scholar
77. 77. Comino F, Aranda V, García-Ruiz R, Ayora-Cañada MJ, Domínguez-Vidal A. Infrared spectroscopy as a tool for the assessment of soil biological quality in agricultural soils under contrasting management practices. Ecological Indicators. 2018;87:117–126.
* View Article
* Google Scholar
78. 78. Hong Y, Sanderman J, Hengl T, Chen S, Wang N, Xue J, et al. Potential of globally distributed topsoil mid-infrared spectral library for organic carbon estimation. CATENA. 2024;235:107628.
* View Article
* Google Scholar
79. 79. Zhou Y, Chen S, Hu B, Ji W, Li S, Hong Y, et al. Global soil salinity prediction by open soil Vis-NIR spectral library. Remote Sens (Basel). 2022;14(21):5627.
* View Article
* Google Scholar
80. 80. Hutengs C, Seidel M, Oertel F, Ludwig B, Vohland M. In situ and laboratory soil spectroscopy with portable visible-to-near-infrared and mid-infrared instruments for the assessment of organic carbon in soils. Geoderma. 2019;355:113900.
* View Article
* Google Scholar
81. 81. Priori S, Mzid N, Pascucci S, Pignatti S, Casa R. Performance of a Portable FT-NIR MEMS Spectrometer to Predict Soil Features. Soil Systems. 2022;6(3).
* View Article
* Google Scholar
82. 82. Fonseca AA, Pasquini C, Soares EMB. Large-scale measurement of soil organic carbon using compact near-infrared spectrophotometers: effect of soil sample preparation and the use of local modelling. Environ Sci: Adv. 2023;2:1372–1381.
* View Article
* Google Scholar
83. 83. Stevens A, van Wesemael B, Bartholomeus H, Rosillon D, Tychon B, Ben-Dor E. Laboratory, field and airborne spectroscopy for monitoring organic carbon content in agricultural soils. Geoderma. 2008;144(1–2):395–404.
* View Article
* Google Scholar
84. 84. Horta A, Malone B, Stockmann U, Minasny B, Bishop T, McBratney A, et al. Potential of integrated field spectroscopy and spatial analysis for enhanced assessment of soil contamination: A prospective review. Geoderma. 2015;241:180–209.
* View Article
* Google Scholar
85. 85. Jia X, O’Connor D, Shi Z, Hou D. VIRS based detection in combination with machine learning for mapping soil pollution. Environmental Pollution. 2021;268:115845. pmid:33120345
* View Article
* PubMed/NCBI
* Google Scholar
86. 86. Vergopolan N, Chaney NW, Pan M, Sheffield J, Beck HE, Ferguson CR, et al. SMAP-HydroBlocks, a 30-m satellite-based soil moisture dataset for the conterminous US. Scientific data. 2021;8(1):264. pmid:34635675
* View Article
* PubMed/NCBI
* Google Scholar
87. 87. Karyotis K, Tsakiridis NL, Tziolas N, Samarinas N, Kalopesa E, Chatzimisios P, et al. On-Site Soil Monitoring Using Photonics-Based Sensors and Historical Soil Spectral Libraries. Remote Sensing. 2023;15(6).
* View Article
* Google Scholar
88. 88. Murad M, Jones E, Minasny B, McBratney A, Wijewardane N, Ge Y. Assessing a VisNIR penetrometer system for in-situ estimation of soil organic carbon under variable soil moisture conditions. Biosystems Engineering. 2022;224:197–212.
* View Article
* Google Scholar
89. 89. Rizzo R, Wadoux AMJC, Demattê JAM, Minasny B, Barrón V, Ben-Dor E, et al. Remote sensing of the Earth’s soil color in space and time. Remote Sensing of Environment. 2023;299:113845.
* View Article
* Google Scholar
90. 90. Demattê JA, Safanelli JL, Poppiel RR, Rizzo R, Silvero NEQ, de Sousa Mendes W, et al. Bare earth’s surface spectra as a proxy for soil resource monitoring. Scientific reports. 2020;10(1):1–11. pmid:32157136
* View Article
* PubMed/NCBI
* Google Scholar
91. 91. Luce MS, Ziadi N, Rossel RAV. GLOBAL-LOCAL: A new approach for local predictions of soil organic carbon content using large soil spectral libraries. Geoderma. 2022;425:116048.
* View Article
* Google Scholar
92. 92. Ramirez-Lopez L, Behrens T, Schmidt K, Stevens A, Demattê JAM, Scholten T. The spectrum-based learner: A new local approach for modeling soil vis–NIR spectra of complex datasets. Geoderma. 2013;195–196:268–279.
* View Article
* Google Scholar
93. 93. Ng W, Minasny B, Jones E, McBratney A. To spike or to localize? Strategies to improve the prediction of local soil properties using regional spectral library. Geoderma. 2022;406:115501.
* View Article
* Google Scholar
94. 94. Minarik R, Safanelli JL, Hengl T, Sanderman J. SS4GG hackathon: NIR soil spectroscopy modeling. Kaggle; 2023. Available from: https://www.kaggle.com/competitions/ss4gg-hackathon-nir-neospectra.
* View Article
* Google Scholar
Citation: Safanelli JL, Hengl T, Parente LL, Minarik R, Bloom DE, Todd-Brown K, et al. (2025) Open Soil Spectral Library (OSSL): Building reproducible soil calibration models through open development and community engagement. PLoS ONE 20(1): e0296545. https://doi.org/10.1371/journal.pone.0296545
About the Authors:
José L. Safanelli
Roles: Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing
Affiliation: Woodwell Climate Research Center, Falmouth, MA, United States of America
ORICD: https://orcid.org/0000-0001-5410-5762
Tomislav Hengl
Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Supervision, Visualization, Writing – original draft, Writing – review & editing
Affiliation: OpenGeoHub foundation, Wageningen, the Netherlands
ORICD: https://orcid.org/0000-0002-9921-5129
Leandro L. Parente
Roles: Data curation, Investigation, Methodology, Resources, Software, Writing – review & editing
Affiliation: OpenGeoHub foundation, Wageningen, the Netherlands
Robert Minarik
Roles: Data curation, Investigation, Methodology, Resources, Software, Validation, Writing – review & editing
Affiliation: OpenGeoHub foundation, Wageningen, the Netherlands
Dellena E. Bloom
Roles: Data curation, Investigation, Methodology, Resources, Software, Writing – review & editing
Affiliation: University of Florida, Gainesville, FL, United States of America
ORICD: https://orcid.org/0000-0002-0598-1747
Katherine Todd-Brown
Roles: Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Resources, Software, Supervision, Writing – review & editing
Affiliation: University of Florida, Gainesville, FL, United States of America
ORICD: https://orcid.org/0000-0002-3109-8130
Asa Gholizadeh
Roles: Data curation, Investigation, Resources, Writing – review & editing
Affiliation: Czech University of Life Sciences Prague, Prague, Czech Republic
Wanderson de Sousa Mendes
Roles: Data curation, Investigation, Methodology, Software, Writing – review & editing
Affiliation: The Food and Agriculture Organization of the United Nations, Rome, Italy
ORICD: https://orcid.org/0000-0003-1271-031X
Jonathan Sanderman
Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Project administration, Resources, Supervision, Visualization, Writing – review & editing
E-mail: [email protected]
Affiliation: Woodwell Climate Research Center, Falmouth, MA, United States of America
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
1. Tóth G, Hermann T, da Silva MR, Montanarella L. Monitoring soil for sustainable development and land degradation neutrality. Environmental Monitoring and Assessment. 2018;190(2). pmid:29302746
2. McBratney A, Field DJ, Koch A. The dimensions of soil security. Geoderma. 2014;213:203–213.
3. Bradford MA, Carey CJ, Atwood L, Bossio D, Fenichel EP, Gennet S, et al. Soil carbon science for policy and practice. Nature Sustainability. 2019;2(12):1070–1072.
4. Nature editors. Safeguarding our soils. Nature Communications. 2017;8(1). pmid:29208945
5. Rojas RV, Achouri M, Maroulis J, Caon L. Healthy soils: a prerequisite for sustainable food security. Environmental Earth Sciences. 2016;75(3).
6. Shepherd KD, Ferguson R, Hoover D, van Egmond F, Sanderman J, Ge Y. A global soil spectral calibration library and estimation service. Soil Security. 2022;7:100061.
7. Wadoux AMJC, Heuvelink GBM, Lark RM, Lagacherie P, Bouma J, Mulder VL, et al. Ten challenges for the future of pedometrics. Geoderma. 2021;401:115155.
8. Hendriks CMJ, Stoorvogel JJ, Lutz F, Claessens L. When can legacy soil data be used, and when should new data be collected instead? Geoderma. 2019;348:181–188.
9. Viscarra-Rossel RA, Behrens T, Ben-Dor E, Chabrillat S, Demattê JAM, Ge Y, et al. Diffuse reflectance spectroscopy for estimating soil properties: A technology for the 21st century. European Journal of Soil Science. 2022;73(4).
10. Li S, Viscarra Rossel RA, Webster R. The cost-effectiveness of reflectance spectroscopy for estimating soil organic carbon. European Journal of Soil Science. 2021;73(1).
11. Ben-Dor E, Banin A. Near-infrared analysis as a rapid method to simultaneously evaluate several soil properties. Soil Science Society of America Journal. 1995;59(2):364–372.
12. Nocita M, Stevens A, van Wesemael B, Aitkenhead M, Bachmann M, Barthès B, et al. Soil spectroscopy: an alternative to wet chemistry for soil monitoring. Advances in agronomy. 2015;132:139–159.
13. Ben-Dor E, Irons J, Epema G. Soil reflectance. In: Rencz AN, Ryerson RA, editors. Remote Sensing for the Earth Sciences: Manual of Remote Sensing. vol. 3. John Wiley and Sons; 1999. p. 111–188.
14. Viscarra-Rossel R, Behrens T, Ben-Dor E, Brown D, Demattê J, Shepherd KD, et al. A global spectral library to characterize the world’s soil. Earth-Science Reviews. 2016;155:198–230.
15. Soriano-Disla JM, Janik LJ, Viscarra Rossel RA, Macdonald LM, McLaughlin MJ. The Performance of Visible, Near-, and Mid-Infrared Reflectance Spectroscopy for Prediction of Soil Physical, Chemical, and Biological Properties. Applied Spectroscopy Reviews. 2013;49(2):139–186.
16. Mastorakis G. Human-like machine learning: limitations and suggestions. CoRR. 2018;abs/1811.06052.
17. Demattê JA, Paiva AFdS, Poppiel RR, Rosin NA, Ruiz LFC, Mello FAdO, et al. The Brazilian Soil Spectral Service (BraSpecS): A User-Friendly System for Global Soil Spectra Communication. Remote Sensing. 2022;14(3):740.
18. Burgelman JC, Pascu C, Szkuta K, Von Schomberg R, Karalopoulos A, Repanas K, et al. Open Science, Open Data, and Open Scholarship: European Policies to Make Science Fit for the Twenty-First Century. Frontiers in Big Data. 2019;2. pmid:33693366
19. McKiernan EC, Bourne PE, Brown CT, Buck S, Kenall A, Lin J, et al. How open science helps researchers succeed. eLife. 2016;5. pmid:27387362
20. Soil Spectroscopy for Global Good. Soil Spectroscopy website; 2024. Available from: https://soilspectroscopy.org/.
21. Soil Spectroscopy for Global Good. Open Soil Spectral Library; 2024. Available from: https://soilspectroscopy.github.io/ossl-manual/.
22. Safanelli JL, Sanderman J, Bloom D, Todd-Brown K, Parente LL, Hengl T, et al. An interlaboratory comparison of mid-infrared spectra acquisition: Instruments and procedures matter. Geoderma. 2023;440:116724.
23. Garrity D, Bindraban P. A globally distributed soil spectral library visible near infrared diffuse reflectance spectra. Nairobi, Kenya: ICRAF (World Agroforestry Centre) / ISRIC (World Soil Information) Spectral Library; 2004. Available from: https://doi.org/10.34725/DVN/MFHA9C.
24. Soil Spectroscopy for Global Good. Soil Spectroscopy GitHub repositories; 2024. Available from: https://github.com/soilspectroscopy.
25. United States Department of Agriculture, National Cooperative Soil Survey. National Cooperative Soil Survey Lab Data Mart (Lab Data); 2023. Available from: https://ncsslabdatamart.sc.egov.usda.gov/querypage.aspx.
26. Seybold CA, Ferguson R, Wysocki D, Bailey S, Anderson J, Nester B, et al. Application of Mid-Infrared Spectroscopy in Soil Survey. Soil Science Society of America Journal. 2019;83(6):1746–1759.
27. Wijewardane NK, Ge Y, Wills S, Libohova Z. Predicting physical and chemical properties of US soils with a mid-infrared reflectance spectral library. Soil Science Society of America Journal. 2018;82(3):722–731.
28. Wills S, Loecke T, Sequeira C, Teachman G, Grunwald S, West LT. Overview of the U.S. Rapid Carbon Assessment Project: Sampling Design, Initial Summary and Uncertainty Estimates. In: Hartemink A, McSweeney K, editors. Soil Carbon. Progress in Soil Science. Springer International Publishing; 2014. p. 95–104. Available from: https://doi.org/10.1007/978-3-319-04084-4_10.
29. Wijewardane NK, Ge Y, Wills S, Loecke T. Prediction of Soil Carbon in the Conterminous United States: Visible and Near Infrared Reflectance Spectroscopy Analysis of the Rapid Carbon Assessment Project. Soil Science Society of America Journal. 2016;80(4):973–982.
30. World Agroforestry (ICRAF), International Soil Reference and Information Centre (ISRIC). ICRAF-ISRIC Soil Spectral Library; 2021. Available from: https://data.worldagroforestry.org/citation?persistentId=doi:10.34725/DVN/MFHA9C.
31. Vagen TG, Winowiecki LA, Desta L, Tondoh EJ, Weullow E, Shepherd K, et al. Mid-Infrared Spectra (MIRS) from ICRAF Soil and Plant Spectroscopy Laboratory: Africa Soil Information Service (AfSIS) Phase I 2009-2013; 2020. Available from: https://doi.org/10.34725/DVN/QXCWP1.
32. Hengl T, Miller MA, Križan J, Shepherd KD, Sila A, Kilibarda M, et al. African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning. Scientific Reports. 2021;11(1):1–18. pmid:33731749
33. Jones A, Fernandez-Ugalde O, Scarpa S. LUCAS 2015 topsoil survey: presentation of dataset and results. European Commission. Joint Research Centre. Publications Office; 2020. Available from: https://data.europa.eu/doi/10.2760/616084.
34. Summerauer L, Baumann P, Ramirez-Lopez L, Barthel M, Bauters M, Bukombe B, et al. The central African soil spectral library: a new soil infrared repository and a geographical prediction analysis. SOIL. 2021;7(2):693–715.
35. Garrett LG, Sanderman J, Palmer DJ, Dean F, Patel S, Bridson JH, et al. Mid-infrared spectroscopy for planted forest soil and foliage nutrition predictions, New Zealand case study. Trees, Forests and People. 2022;8:100280.
36. Schiedung M, Bellè SL, Malhotra A, Abiven S. Organic carbon stocks, quality and prediction in permafrost-affected forest soils in North Canada. CATENA. 2022;213:106194.
37. Jović B, Ćirić V, Kovačević M, Šeremešić S, Kordić B. Empirical equation for preliminary assessment of soil texture. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy. 2019;206:506–511. pmid:30176426
38. Sanderman J, Smith C, Safanelli JL, Mitu SM, Ge Y, Murad O, et al. Near-infrared (NIR) soil spectral library using the NeoSpectra Handheld NIR Analyzer by Si-Ware; 2023. Available from: https://zenodo.org/record/7600137.
39. Francos N, Gholizadeh A, Demattê JAM, Ben-Dor E. Effect of the internal soil standard on the spectral assessment of clay content. Geoderma. 2022;420:115873.
40. Cools N, Delanote V, Scheldeman X, Quataert P, Vos BD, Roskams P. Quality assurance and quality control in forest soil analyses: a comparison between European soil laboratories. Accreditation and Quality Assurance. 2004;9:688–694.
41. Pimstein A, Notesco G, Ben-Dor E. Performance of Three Identical Spectrometers in Retrieving Soil Reflectance under Laboratory Conditions. Soil Science Society of America Journal. 2011;75:746–759.
42. FAO Global Soil Partnership. Standard Operating Procedures (SOPs); 2023. Available from: https://www.fao.org/global-soil-partnership/glosolan-old/soil-analysis/standard-operating-procedures/en/.
43. Soil Survey Staff. Kellogg Soil Survey Laboratory methods manual. Soil Survey Investigations Report No. 42, Version 6.0. U.S. Department of Agriculture, Natural Resources Conservation Service.; 2022. Available from: https://www.nrcs.usda.gov/resources/guides-and-instructions/kssl-guidance.
44. ISRIC World Soil Information. International Soil Standards; 2023. Available from: https://www.isric.org/international-soil-standards.
45. Bispo A, Andersen L, Angers DA, Bernoux M, Brossard M, Cécillon L, et al. Accounting for Carbon Stocks in Soils and Measuring GHGs Emission Fluxes from Soils: Do We Have the Necessary Standards? Frontiers in Environmental Science. 2017;5.
46. Stevens A, Ramirez-Lopez L. An introduction to the prospectr package; 2022.
47. Lang M, Binder M, Richter J, Schratz P, Pfisterer F, Coors S, et al. mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software. 2019.
48. Quinlan J. Learning with continuous classes. In: Proc. 5th Australian Joint Conference on Artificial Intelligence, Tasmania, 1992; 1992. p. 343–348.
49. Quinlan J. Combining instance-based and model-based learning. In: Proc. Tenth Int. Conference on Machine Learning; 1993. p. 236–243.
50. Dangal S, Sanderman J, Wills S, Ramirez-Lopez L. Accurate and Precise Prediction of Soil Properties from a Large Mid-Infrared Spectral Library. Soil Systems. 2019;3(1):11.
51. Minasny B, McBratney AB. Regression rules as a tool for predicting soil properties from infrared reflectance spectroscopy. Chemometrics and Intelligent Laboratory Systems. 2008;94(1):72–79.
52. Barnes RJ, Dhanoa MS, Lister SJ. Standard Normal Variate Transformation and De-Trending of Near-Infrared Diffuse Reflectance Spectra. Applied Spectroscopy. 1989;43(5):772–777.
53. Barra I, Haefele SM, Sakrabani R, Kebede F. Soil spectroscopy with the use of chemometrics, machine learning and pre-processing techniques in soil diagnosis: Recent advances—A review. TrAC Trends in Analytical Chemistry. 2021;135:116166.
54. Vestergaard RJ, Vasava H, Aspinall D, Chen S, Gillespie A, Adamchuk V, et al. Evaluation of Optimized Preprocessing and Modeling Algorithms for Prediction of Soil Properties Using VIS-NIR Spectroscopy. Sensors. 2021;21(20):6745. pmid:34695958
55. Yang L, Shami A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing. 2020;415:295–316.
56. de Santana FB, Hall RL, Lowe V, Browne MA, Grunsky EC, Fitzsimons MM, et al. A systematic approach to predicting and mapping soil particle size distribution from unknown samples using large mid-infrared spectral libraries covering large-scale heterogeneous areas. Geoderma. 2023;434:116491.
57. Chang CW, Laird DA, Mausbach MJ, Hurburgh CR. Near-Infrared Reflectance Spectroscopy-Principal Components Regression Analyses of Soil Properties. Soil Science Society of America Journal. 2001;65(2):480–490.
58. Jackson JE, Mudholkar GS. Control Procedures for Residuals Associated With Principal Component Analysis. Technometrics. 1979;21(3):341–349.
59. Laursen K, Rasmussen MA, Bro R. Comprehensive control charting applied to chromatography. Chemometrics and Intelligent Laboratory Systems. 2011;107(1):215–225.
60. Wadoux AMJC, Malone B, McBratney AB, Fajardo M, Minasny B. Soil Spectral Inference with R: Analysing Digital Soil Spectra Using the R Programming Environment. Progress in Soil Science. Springer International Publishing; 2021.
61. Norinder U, Carlsson L, Boyer S, Eklund M. Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination. Journal of Chemical Information and Modeling. 2014;54(6):1596–1603. pmid:24797111
62. Cortés-Ciriano I, van Westen GJP, Bouvier G, Nilges M, Overington JP, Bender A, et al. Improved large-scale prediction of growth inhibition patterns using the NCI60 cancer cell line panel. Bioinformatics. 2015;32(1):85–95. pmid:26351271
63. Geladi P, Kowalski BR. Partial least-squares regression: a tutorial. Analytica Chimica Acta. 1986;185:1–17.
64. Sanderman J, Savage K, Dangal SRS, Duran G, Rivard C, Cavigelli MA, et al. Can Agricultural Management Induced Changes in Soil Organic Carbon Be Detected Using Mid-Infrared Spectroscopy? Remote Sensing. 2021;13(12):2265.
65. Soil Spectroscopy for Global Good. Open Soil Spectral Library (OSSL) Explorer application; 2024. Available from: https://explorer.soilspectroscopy.org/.
66. Soil Spectroscopy for Global Good. Open Soil Spectral Library (OSSL) Engine application; 2024. Available from: https://engine.soilspectroscopy.org/.
67. Stoner ER, Baumgardner M. Characteristic variations in reflectance of surface soils. Soil Science Society of America Journal. 1981;45(6):1161–1165.
68. Margenot AJ, Calderón FJ, Goyne KW, Mukome FND, Parikh SJ. IR Spectroscopy, Soil Analysis Applications. In: Lindon JC, Tranter GE, Koppenaal DW, editors. Encyclopedia of Spectroscopy and Spectrometry. 3rd ed. Academic Press; 2017. p. 448–454. Available from: https://www.sciencedirect.com/science/article/pii/B9780124095472121705.
69. Viscarra-Rossel RA, Walvoort DJJ, McBratney AB, Janik LJ, Skjemstad JO. Visible, near infrared, mid infrared or combined diffuse reflectance spectroscopy for simultaneous assessment of various soil properties. Geoderma. 2006;131(1–2):59–75.
70. Ng W, Minasny B, Jeon SH, McBratney A. Mid-infrared spectroscopy for accurate measurement of an extensive set of soil properties for assessing soil functions. Soil Security. 2022;6:100043.
71. De Maesschalck R, Jouan-Rimbaud D, Massart DL. The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems. 2000;50(1):1–18.
72. Todd-Brown KEO, Abramoff RZ, Beem-Miller J, Blair HK, Earl S, Frederick KJ, et al. Reviews and syntheses: The promise of big diverse soil data, moving current practices towards future potential. Biogeosciences. 2022;19(14):3505–3522.
73. GDAL/OGR contributors. GDAL/OGR Geospatial Data Abstraction Software Library; 2023. Available from: https://gdal.org.
74. Telenius A. Biodiversity information goes public: GBIF at your service. Nordic Journal of Botany. 2011;29(3):378–381.
75. Karlen DL, Rice CW. Soil Degradation: Will Humankind Ever Learn? Sustainability. 2015;7(9):12490–12501.
76. Gonzalez-Roglich M, Zvoleff A, Noon M, Liniger H, Fleiner R, Harari N, et al. Synergizing global tools to monitor progress towards land degradation neutrality: Trends. Earth and the World Overview of Conservation Approaches and Technologies sustainable land management database. Environmental Science & Policy. 2019;93:34–42.
77. Comino F, Aranda V, García-Ruiz R, Ayora-Cañada MJ, Domínguez-Vidal A. Infrared spectroscopy as a tool for the assessment of soil biological quality in agricultural soils under contrasting management practices. Ecological Indicators. 2018;87:117–126.
78. Hong Y, Sanderman J, Hengl T, Chen S, Wang N, Xue J, et al. Potential of globally distributed topsoil mid-infrared spectral library for organic carbon estimation. CATENA. 2024;235:107628.
79. Zhou Y, Chen S, Hu B, Ji W, Li S, Hong Y, et al. Global soil salinity prediction by open soil Vis-NIR spectral library. Remote Sens (Basel). 2022;14(21):5627.
80. Hutengs C, Seidel M, Oertel F, Ludwig B, Vohland M. In situ and laboratory soil spectroscopy with portable visible-to-near-infrared and mid-infrared instruments for the assessment of organic carbon in soils. Geoderma. 2019;355:113900.
81. Priori S, Mzid N, Pascucci S, Pignatti S, Casa R. Performance of a Portable FT-NIR MEMS Spectrometer to Predict Soil Features. Soil Systems. 2022;6(3).
82. Fonseca AA, Pasquini C, Soares EMB. Large-scale measurement of soil organic carbon using compact near-infrared spectrophotometers: effect of soil sample preparation and the use of local modelling. Environ Sci: Adv. 2023;2:1372–1381.
83. Stevens A, van Wesemael B, Bartholomeus H, Rosillon D, Tychon B, Ben-Dor E. Laboratory, field and airborne spectroscopy for monitoring organic carbon content in agricultural soils. Geoderma. 2008;144(1–2):395–404.
84. Horta A, Malone B, Stockmann U, Minasny B, Bishop T, McBratney A, et al. Potential of integrated field spectroscopy and spatial analysis for enhanced assessment of soil contamination: A prospective review. Geoderma. 2015;241:180–209.
85. Jia X, O’Connor D, Shi Z, Hou D. VIRS based detection in combination with machine learning for mapping soil pollution. Environmental Pollution. 2021;268:115845. pmid:33120345
86. Vergopolan N, Chaney NW, Pan M, Sheffield J, Beck HE, Ferguson CR, et al. SMAP-HydroBlocks, a 30-m satellite-based soil moisture dataset for the conterminous US. Scientific data. 2021;8(1):264. pmid:34635675
87. Karyotis K, Tsakiridis NL, Tziolas N, Samarinas N, Kalopesa E, Chatzimisios P, et al. On-Site Soil Monitoring Using Photonics-Based Sensors and Historical Soil Spectral Libraries. Remote Sensing. 2023;15(6).
88. Murad M, Jones E, Minasny B, McBratney A, Wijewardane N, Ge Y. Assessing a VisNIR penetrometer system for in-situ estimation of soil organic carbon under variable soil moisture conditions. Biosystems Engineering. 2022;224:197–212.
89. Rizzo R, Wadoux AMJC, Demattê JAM, Minasny B, Barrón V, Ben-Dor E, et al. Remote sensing of the Earth’s soil color in space and time. Remote Sensing of Environment. 2023;299:113845.
90. Demattê JA, Safanelli JL, Poppiel RR, Rizzo R, Silvero NEQ, de Sousa Mendes W, et al. Bare earth’s surface spectra as a proxy for soil resource monitoring. Scientific reports. 2020;10(1):1–11. pmid:32157136
91. Luce MS, Ziadi N, Rossel RAV. GLOBAL-LOCAL: A new approach for local predictions of soil organic carbon content using large soil spectral libraries. Geoderma. 2022;425:116048.
92. Ramirez-Lopez L, Behrens T, Schmidt K, Stevens A, Demattê JAM, Scholten T. The spectrum-based learner: A new local approach for modeling soil vis–NIR spectra of complex datasets. Geoderma. 2013;195–196:268–279.
93. Ng W, Minasny B, Jones E, McBratney A. To spike or to localize? Strategies to improve the prediction of local soil properties using regional spectral library. Geoderma. 2022;406:115501.
94. Minarik R, Safanelli JL, Hengl T, Sanderman J. SS4GG hackathon: NIR soil spectroscopy modeling. Kaggle; 2023. Available from: https://www.kaggle.com/competitions/ss4gg-hackathon-nir-neospectra.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 Safanelli et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Soil spectroscopy is a widely used method for estimating soil properties that are important to environmental and agricultural monitoring. However, a bottleneck to its more widespread adoption is the need for establishing large reference datasets for training machine learning (ML) models, which are called soil spectral libraries (SSLs). Similarly, the prediction capacity of new samples is also subject to the number and diversity of soil types and conditions represented in the SSLs. To help bridge this gap and enable hundreds of stakeholders to collect more affordable soil data by leveraging a centralized open resource, the Soil Spectroscopy for Global Good initiative has created the Open Soil Spectral Library (OSSL). In this paper, we describe the procedures for collecting and harmonizing several SSLs that are incorporated into the OSSL, followed by exploratory analysis and predictive modeling. The results of 10-fold cross-validation with refitting show that, in general, mid-infrared (MIR)-based models are significantly more accurate than visible and near-infrared (VisNIR) or near-infrared (NIR) models. From independent model evaluation, we found that Cubist comes out as the best-performing ML algorithm for the calibration and delivery of reliable outputs (prediction uncertainty and representation flag). Although many soil properties are well predicted, total sulfur, extractable sodium, and electrical conductivity performed poorly in all spectral regions, with some other extractable nutrients and physical soil properties also performing poorly in one or two spectral regions (VisNIR or NIR). Hence, the use of predictive models based solely on spectral variations has limitations. This study also presents and discusses several other open resources that were developed from the OSSL, aspects of opening data, current limitations, and future development. With this genuinely open science project, we hope that OSSL becomes a driver of the soil spectroscopy community to accelerate the pace of scientific discovery and innovation.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer