1 Introduction
The implications of the 2030 Agenda for Sustainable Development necessitate the utilization of high-quality monitoring data for the purpose of gauging progress and facilitating evidence-based policymaking (Allen, 2021). Water, constituting the foundational pillar of sustainable development (WWAP, 2019), bears a profound interconnection with numerous targets within the Sustainable Development Goals (SDGs), notably SDG 6 (Sadoff et al., 2020), which endeavors to ensure the universal availability and sustainable management of water and sanitation, and SDG 14, which focuses on the conservation and sustainable utilization of oceans, seas, and marine resources. With the campaign of ecological civilization and a series of marine policies (e.g., Maritime Power and Strategy, Chen et al., 2019), China is committed to the preservation of water resources while simultaneously advancing resource management methodologies. To effectively accomplish the United Nations SDGs and align with China's extensive policy frameworks, it is crucial to systematically compile water-related data across both inland and coastal and oceanic domains (Dai et al., 2022; Plagányi et al., 2023). Within the context of the Source-to-Sea (S2S) aquatic continuum, water quality data emerge as a pivotal factor in discerning pollution levels (Regnier et al., 2022). This information plays a critical role in the preservation of water resources and the provision of sanitation services (WWAP, 2023).
Water quality refers to the selected physical, chemical, and biological characteristics of water that determine its suitability for a particular use (World Health Organization, 2017; Johnson et al., 1997). There are some key properties widely recognized for measuring water quality. In terms of physical characteristics, key considerations include the color, temperature (TEMP), sediment content, turbidity, electrical conductivity, and the concentration of total suspended solids (TSSs) (Oteng-Peprah et al., 2018). Chemical constituents play a significant role in the determination of water quality. These encompass parameters such as the potential of hydrogen (pH), acidity levels, and indicators reflecting nutrient levels, including ammonia nitrogen (NHN), nitrite nitrogen (NON), and nitrate nitrogen (NON), and various forms of phosphorus such as dissolved inorganic phosphorus (DIP) and total phosphorus (TP). Additionally, the concentration of oxygen required for microorganisms to decompose organic matter is highly considered, which includes biochemical oxygen demand (BOD), chemical oxygen demand (COD), and dissolved oxygen (DO) (Hassan Omer, 2020). Biological indicators provide insights into the presence, condition, and abundance of various living organisms within water bodies, such as bacteria, algae, and pathogens. Overall, these indicators are crucial for assessing water quality and ensuring the health of aquatic ecosystems and human populations that rely on clean water sources.
Sustaining elevated water quality standards stands as an imperative requisite for the perpetuity of diverse spheres, encompassing natural ecosystems, public health, and socioeconomic systems. Contaminants such as excessive nutrients that enter water bodies can have detrimental effects on the integrity, functioning, and biodiversity of both riverine and oceanic ecosystems which provide a habitat for a diverse array of flora and fauna (Morin and Artigas, 2023). For instance, the influx of pesticides into aquatic systems has been unequivocally associated with the diminishment of aquatic species and perturbations in food chains (Stehle and Schulz, 2015). Consequently, the unwavering adherence to stringent water quality standards emerges as an imperative measure for ameliorating the adversative consequences, thereby safeguarding fragile habitats, and preserving ecological equilibrium (Hering et al., 2010). Furthermore, the assurance of clean water represents a fundamental safeguard against the outbreak of waterborne maladies (Gleick and Palaniappan, 2010), with direct implications for the preservation of public health (Prüss-Ustün et al., 2014) and the concomitant mitigation of healthcare expenditures. Maladies such as cholera, typhoid, and hepatitis find direct causation in the inadequacy of water quality (Leju Celestino Ladu et al., 2018). Lastly, impaired water quality can have severe economic consequences, including reduced agricultural productivity, increased costs of water treatment, and damage to tourism industries reliant on pristine water bodies (United Nations, 2018).
The recognition of the significance of the water quality to nature, society, food, and security has accelerated the rise and availability of local, national, and global water quality datasets. For example, local water quality datasets include the dataset Water QUAlity, DIscharge and Catchment Attributes providing data for 1386 German catchments for the purpose of studying the species of nitrogen, phosphorus, and organic carbon (Ebeling et al., 2022); a set of water chemistry measurements including carbon species, dissolved nutrients, and major ions to describe the biogeochemical conditions of permafrost-affected Arctic watersheds (Shogren et al., 2022); and a catchment-wide biogeochemical monitoring platform for capturing water temperature, pH, alkalinity, suspended solid, chlorophyll concentrations, and nutrient and cation data of the Thames basin in the United Kingdom to promote drinking water resource management (Bowes et al., 2018). The Water Quality Portal (WQP) comprises thousands of water quality variables encompassing physical conditions, chemical and bacteriological water analyses, chemical analyses of fish tissue, taxon abundance data, toxicity data, habitat assessment scores, and biological index scores, which was widely applied to several domains (e.g., to examine water clarity in lakes and reservoirs; Read et al., 2017). Aggregating five large water quality datasets, the Global River Water Quality Archive (GRQA) has significantly expanded both the geographic and historical reach of existing water quality datasets by incorporating 42 parameters related to nutrient species, carbon content, sediment composition, and oxygen levels (Virro et al., 2021).
Despite significant advances in open-data science for water quality research globally, Asia lags far behind other regions in this regard (Virro et al., 2021; Lin et al., 2023a). As the largest country in East Asia, China's water quality data are notably limited in the comprehensive global dataset, with a notable absence of data from coastal and oceanic regions. The publicly available data consist of a total of only 3595 daily observations from 24 sites, spanning from 1980 to 2009, as documented in GRQA. This is far from adequate for water quality analysis and modeling. Additionally, the water data available from open-data centers are stored in a user-unfriendly format that requires significant additional efforts to make them credible, editable, and reusable. For example, monthly water quality data spanning from 2006 to 2022 are presented as reports with figures derived from statistical analysis, instead of providing more reliable monitoring data. Although some studies have employed national-scale water quality data for assessment and modeling in China (Ma et al., 2020a, b; Huang et al., 2021; Zhang et al., 2022), these datasets are not publicly available due to licensing restrictions and/or government sanctions (Lin et al., 2023a). To date, there is no clean and publicly accessible national water quality dataset that covers the entirety of China.
Therefore, there is a pressing need to reorganize, curate, and manage the continuous, long-time-series, standardized, well-organized, and consistent water quality datasets from inland to coastal and oceanic areas within China. These datasets represent invaluable resources to support researchers and decision-makers (Van Vliet et al., 2023). They enable an in-depth examination of water quality status, encompassing the entire spectrum from riverine environments to the vast expanse of the oceans. Furthermore, they provide the means to model various dimensions of water quality indicators and forecast the ramifications of emergent water pollution phenomena (i.e., coastal eutrophication and oceanic harmful algal blooms due to additional nitrogen input from land and releases of radionuclides from inland redundant nuclear power plant accidents). They are also valuable for the effective management of water resources to support the United Nations Water Action Decade (2018–2028) and Ocean Decade (2021–2030; Folke et al., 2021). Our water quality dataset is thus initiated to meet the huge demand for Chinese water quality data, to boost national water data sharing, and to advance global water-related research and applications. It aims to collect non-sensitive and publicly available water quality data, to apply consistency to the formatting and curation, and to establish a standardized set of metadata for different water quality aspects.
2 Data and methods
2.1 Openly accessible data sources
The Chinese surface water quality dataset presented herein is derived from three publicly accessible online data sources. Details of these original datasets are provided in Table 1.
Table 1
Source datasets for compiling China water quality dataset.
Name | Data sources | Timestep | Original observations | Timeframe | Number of | Number of sites |
---|---|---|---|---|---|---|
(source/China) | parameters | (source/China) | ||||
Global daily waterquality data | Global River Water Quality Archive (GRQA) | Daily | 17 000 000/3595 | 1898–2020 | 42 | 93 057/244 |
National weekly waterquality data | China National Environmental Monitoring Centre (CNEMC) | Weekly (7 d moving average) | 225 336/225 336 | 2007–2018 | 4 | 150/150 |
National monthly waterquality data | National Marine Environmental Monitoring Center (NMEMC) | Monthly | 116 304/116 304 | 2017–2022 | 6 | 1991/1991 |
As the most comprehensive water quality dataset, GRQA has incorporated inland water quality data from five existing sources, including the Canadian Environmental Sustainability Indicators program, Global Freshwater Quality Database, GLObal RIver Chemistry database, European Environment Agency, and USGS WQP for selected 42 water quality parameters (e.g., nutrients, carbon, oxygen, and sediments; Read et al., 2017; Virro et al., 2021) with globally 93 057 sites in total spanning from 1898 to 2020 (Table 1).
2.1.2 CNEMC
As the most advanced and complete environmental data center, the China National Environmental Monitoring Centre (CNEMC) is an online information system managed by the agency of the China Ministry of Ecology and Environment. The CNEMC was established in 1979 to monitor all environmental aspects (e.g., quality of air, water, soil), to provide publicly online data, to assess environmental impacts, and to report on the status of water environments for local and national governments. Water quality data available from this center included yearly water quality reports spanning from 2006 to 2022 (
These weekly water quality data were collected and constructed by following the standards from the Environmental Quality Standards for Surface Water (GB3838-2002). Water samples were automatically collected at six intervals throughout the day, with a sampling frequency of one sample every 4 h (00:00–04:00, 04:00–08:00, 08:00–12:00, 12:00–16:00, 16:00–20:00, 20:00–24:00 UTC+8). The weekly water quality dataset was derived through the computation of daily averages encompassing Monday through Sunday. This process yielded a single numerical value that served as a representative of a set of valid data samples. Specifically, a minimum of four data samples were aggregated to calculate the daily average, and five daily average data points were used to compute the weekly average.
2.1.3 NMEMC
The National Marine Environmental Monitoring Center (NMEMC), which has been maintained by the China Ministry of Ecology and Environment since 2018, is an agency with a history of 60 years specializing in marine ecological and environmental monitoring and protection. Monthly coastal and oceanic water quality data were accessible via
Guidelines in the Specification for Offshore Environmental Monitoring (HJ 442-2008) directed the methodologies, criteria, and quality assurance measures for monthly sampling of oceanic water quality. Employing Niskin and Go-Flo water samplers, samples were collected multiple times annually, typically during the months of April through December, as illustrated in Fig. 1. The acquisition of this dataset entailed the collection of various quality control samples, including matrix spikes, blanks, parallels, and quality control check samples, which underwent meticulous collection and subsequent intra-laboratory comparison.
Figure 1
Sampling frequency for oceanic water quality.
[Figure omitted. See PDF]
2.2 Procedure for downloading and preprocessing source data2.2.1 Data capturing
We extracted those sites located in China based on the geopolitical map after importing all coordinate data of the GRQA dataset into ArcGIS 10.8. Afterwards, metadata information of countries and/or regions from GRQA were tidied and renamed for consistency. For instance, regions identified as “HK”, “Macao”, and “Taiwan” were renamed as “China”. Therefore, we obtained daily water quality data in China from GRQA, which consisted of 244 sites for 15 selected water quality indicators (i.e., BOD, DO, COD, DIP, dissolved oxygen saturation (DOSAT), NHN, NON, NON, pH, total dissolved phosphorus (TDP), TEMP, TP, TSSs, dissolved organic carbon (DOC), and total organic carbon (TOC)).
Weekly water quality data were tidied up from the report collection derived from
We have collected the monthly coastal and oceanic water quality data from the NMEMC manually for the years 2017, 2018, 2019, 2020, 2021, and 2022. All data were stored as CSV files and were appended into a single worksheet file, which consisted of 14 columns (i.e., oceanic name (MonitoringLocationDescriptionText), province (ProvinceName), city (CityCode), code of the monitoring site (Source_MonitoringLocationCode), longitude (LongitudeMeasure_WGS84), latitude (LatitudeMeasure_WGS84), monitoring date (MonitoringDate), values of the indicators, water quality index for the current month). The column of the water quality index was removed. Indicators of the coastal and oceanic water quality data included COD, dissolved inorganic nitrogen (DIN), DO, DIP, pH, and TPH.
2.2.2 Coordinates of the monitoring sites
Information of longitude and latitude is fundamental for identifying the location of a monitoring site. This information was used to export spatial point data and was overlapped with other maps to obtain metadata information.
For daily water quality data, the longitude and latitude information was given by the GRQA dataset. The site location for weekly water quality data was coded as plain text of the administrative address, lacking geographic coordinates (i.e., longitude, latitude). We first used geocoding API methods to find the address for a given place, thereby transforming the address into a corresponding geographic entity. Afterwards, we validated each of them by overlapping with the layers of watersheds and rivers according to the official maps obtained from the National Geomatics Center of China (
General information for the monthly coastal and oceanic water quality data could be found via the NMEMC. However, there were some information inconsistencies in longitude and latitude for the same station or place. For example, the station with code number FJD10003 was recorded with 120.57° E and 26.84° N in the year 2021 but with 120.58° E and 26.84° N in 2022. In addition, some stations with the same longitude and latitude may have different code numbers. Therefore, we first grouped them by code numbers and computed the average value of the longitude and latitude of that station to replace the initial value. Subsequently, we removed the column of the code number to avoid the same stations. Finally, we dropped the duplicated rows to get the unique stations.
All the transferred longitude and latitude information was merged into a single table and then imported into ArcGIS 10.8 as point shapefile in World Geodetic System 1984 (WGS84). After overlapping with the city-level administrative map and watersheds delineation map obtained from the National Geomatics Center of China, we derived other metadata information such as city, sub-watersheds (MonitoringLocationTypeName), etc. We referred to the China Area Code and Zip Code, Version 2021, for the province (ProvinceCode) and city (CityCode).
2.2.3 Data cleaning and technical validation
We undertook a comprehensive standardization process across all the aforementioned data providers. This harmonization encompassed the transformation of downloaded time series into a uniform file format, shifting from CSV files to R time series. Additionally, we ensured consistency in indicator selection, units, data structure, identification of missing values, and language.
Given the limited availability of indicators within the (sub)datasets, all of them were incorporated into our water quality dataset. This inclusive selection comprised both physical parameters (e.g., TEMP, TSSs) and chemical parameters (e.g., pH, BOD, COD, COD, DO, DOSAT, DIN, NH4N, NO2N, NO3N, TDP, DIP, TP, TPH, DOC, TOC). We adopted GRQA as a reference for indicator abbreviations, with the aim of facilitating international compatibility when appending to global datasets. It is noteworthy that, except for temperature (°C), pH, and DOSAT (%), the original unit of measurements for all indicators in the (sub)datasets was milligrams per liter (mg L), and we retained this unit uniformity for consistency. Eight columns (i.e., MonitoringLocationIdentifier, LongitudeMeasure_WGS84, LatitudeMeasure_WGS84, MonitoringDate (with the format % d/% m/% y), IndicatorsName, Value, Unit, SourceProvider) were then included for structuring the full dataset. A column for MonitoringLocationIdentifier was created as an index to connect with the metadata file.
Some observations for different indicators were merged into a single column when converting the PDF file to editable files for weekly water quality data. Those columns were selected to be divided and tidied up into several columns via regular expression automatically and validation manually. Three additional columns were added to indicate the specific year (column MonitoringYear), week number (column MonitoringWeek), and monitoring date (column MonitoringDate) for the weekly water quality data. The specific years and week numbers were subtracted from the file names. The column of MonitoringDate for that specific week was estimated using R according to the international standard ISO 8601 that Monday was considered the first day of a week. They were validated with the descriptive text on the cover of each report that was deleted later from the weekly water quality dataset. The column of MonitoringDate from ocean water quality data was assumed to occur on the first day of that month to keep consistency in the date format of other datasets.
In addition, duplicated rows were identified and removed by using a distinct function in R based on the unique site, indicators, monitoring week/date, and values from the (sub)datasets that included 1776 site pairs from the weekly water quality dataset due to the file inconsistencies mentioned in Sect. 2.2.1. Negative values (with seven observations) were omitted from the weekly water quality dataset. No duplicated rows and negative values were identified from the monthly water quality datasets. In cases where seven sites provided two daily observations but lacked specific timestamp information from the GRQA, we substituted these records with the average value calculated of the two observations. Missing (e.g., noted as “–”) and empty data were replaced with NA, and were omitted from the dataset. Values that fell below known detection limits were denoted as “ DL” in the monthly water quality datasets, which contain 3490 data points. COD, DO, DIN, DIP, and TPH detection limits were 0.15, 0.32, 0.001, 0.001, and 0.001 mg L, respectively. The descriptions in the stations that were originally in Chinese were replaced with Hanyu Pinyin.
2.3 Methods for quality assurance
Since data quality will generate bias and uncertainty for the results despite conducting imputation (Tiyasha et al., 2020), it was a necessary step to conduct data quality assurance to determine the shortcomings, errors, and issues in the research results, and ensure a robust study for different data users (Koelmans et al., 2019). In this paper, we used data availability and outliers for identifying quality assurance characteristics.
2.3.1 Availability
Data availability was characterized to assess the available records, both spatially and temporally (Cai and Zhu, 2015). For each time series, we first counted the length of the records (LengthofData) to illustrate the general temporal coverage. Then, we assessed the data intensity, computed as the ratio between the length of the time series and the length of the time series without missing values. Furthermore, we used overall availability, longest availability, and continuity to measure the characteristics of availability following the methods from Crochemore et al. (2019).
2.3.2 Outlier detection and treatment
Outliers were detected by using the interquartile range (IQR) method. IQR is the range between the first (Q1) and third (Q3) quartile. Data points that fell below Q1–1.5 IQR and above Q3 1.5 IQR were considered outliers. Since it was difficult to determine whether an outlier is an error caused by faulty equipment or data entry errors or not, no observations were omitted from the original datasets.
3 Data records
3.1 General information of metadata
All data were constructed in the form of CSV, while site information was provided with the point shapefile (.shp) map (available for download at 10.6084/m9.figshare.22584742; Lin et al., 2023b). Referring to the inventory information of WQP, descriptions of the metadata for each time series of the water quality dataset are explained in Table 2.
Table 2
Metadata information for water quality data.
Field name | General introduction | Descriptions | Data type |
---|---|---|---|
ID | / | Identifier for each time series | Int |
WaterDataType | Water data type within a broader aspect | “W2” stands for water quality data | String |
MonitoringLocationIdentifier | Identifier for monitoring site | Identifiers for the stations | Int |
MonitoringLocationDescriptionText | Given by the data source | String | |
MonitoringLocationName | Given by the data source | Name of the station | String |
MonitoringLocationType | Indicate the type of monitoring site | River, Lake, Reservoir, Ocean | String |
MonitoringLocationTypeCode | Use code to indicate the type | River (R), Lake (L), Reservoir (V), Ocean (C) | Character |
MonitoringLocationTypeName | Specify the name of the monitoring site | In which rivers, which lakes | String |
Source_MonitoringLocationCode | Location code from the original datasets | String | |
LongitudeMeasure_WGS84 | Float | ||
LatitudeMeasure_WGS84 | Float | ||
ProvinceName | The acronym of a specific province | String | |
ProvinceCode | China area code and zip code | Int | |
CityCode | China area code and zip code | Int | |
IndicatorsName | String | ||
IndicatorsUnit | String | ||
ResolutionCode | Use numbers to identify the temporal resolution | Int | |
ResolutionName | Temporal resolution | String | |
CountryCode | Int | ||
StartDate | Date | ||
EndDate | Date | ||
LengthofData | The count of observations in each time series | Int | |
DataIntensity | Ratio between the length of the time series and the length of the time series without missing values | Float | |
OverallAvailability | Length of the observation series, as a fraction of the dataset longest period | Refers to Crochemore et al. (2019) | Float |
LongestAvailability | Length of the longest observation series without gaps, as a fraction of the dataset longest period | Refers to Crochemore et al. (2019) | Float |
Continuity | Ratio between longest availability and overall availability | Refers to Crochemore et al. (2019) | Float |
SourceProvider | Data source | String | |
SourceProviderID | To separate the type of data source | Classified as authoritative and non-authoritative | String |
After conducting cross-validation, it was observed that there was no spatial convergence among monitoring sites from different data sources (Fig. 2). The dataset contained a large number of monitoring sites for the coastal and oceanic areas obtained from NMENC (Fig. 2). Most GRQA sites were located in tributaries, while the CNEMC provided most of the sites from the mainstream.
Figure 2
Spatial distribution of water quality monitoring sites from different sources with drainages in China.
[Figure omitted. See PDF]
Our dataset encompassed monitoring site records spanning from 1980 to 2022 (Fig. 3). The number of sites for daily, weekly, and monthly observations was 244, 149, and 1991 respectively. Overall, the number of monitoring sites with records exhibited a slight increase before 2016, followed by a significant surge after 2016. Notably, GRQA predominantly contributes observations from monitoring sites prior to 2006, with an average of 133 observations obtained from approximately 13 sites per year, as illustrated in Fig. 3a and b. By contrast, CNEMC provides data from monitoring sites between 2007 and 2018, averaging around 126 sites per year, while NMEMC covers the period from 2017 to 2022 with an average of approximately 1249 sites per year. Despite CNEMC providing fewer monitoring sites, it consists of a comparable number of observations with an average of approximately 18 145 observations per year compared to NMEMC with an average of 19 159 observations. CNEMC and NMEMC datasets offer a greater number of records in comparison with GRQA. Temporal overlaps between various sources were identified on two occasions. The first instance transpired during the years 2007–2009, involving data from the GRQA and the CNEMC. The second temporal overlap was documented between CNEMC and NMEMC for the years 2017–2018.
Figure 3
Distribution of monitoring sites (a) and observations (b) from different sources over time.
[Figure omitted. See PDF]
3.3 Characteristics of time seriesThe study identified four distinct types of monitoring site, comprising rivers, lakes, reservoirs, and coast/ocean (Table 3). The majority of the monitoring sites were located by the coast/ocean, with 1991 sites, followed by 365 sites in rivers that encompassed most of the indicators. Rivers from CNEMC demonstrated a considerable number of observations for COD, DO, NHN, and pH indicators, while COD, DIN, DIP, DO, pH, and TPH indicators have the most observations in the ocean. Despite having fewer sites and observations for most indicators, rivers had a longer time series period compared to other types. Indicators of COD, DIP, and TPH exhibited some values that fell below the detection limits.
Table 3
Statistics for different types of the monitoring sites and indicators.
Location | Sites in | Indicator | Indicator | Sites | Observations | Start | End | Below | Outliers | Sources |
---|---|---|---|---|---|---|---|---|---|---|
type | total | number | name | date | date | limits () | (%) | () | ||
Coast and ocean | 1991 | 6 | COD | 1991 | 19 367 | May 2017 | Aug 2022 | 94 | 4.88 | NMEMC |
DIN | 1991 | 19 369 | May 2017 | Aug 2022 | / | 8.99 | NMEMC | |||
DIP | 1991 | 19 369 | May 2017 | Aug 2022 | 939 | 6.76 | NMEMC | |||
DO | 1991 | 18 143 | May 2017 | Aug 2022 | / | 2.78 | NMEMC | |||
pH | 1991 | 19 338 | May 2017 | Aug 2022 | / | 3.69 | NMEMC | |||
TPH | 1991 | 19 368 | May 2017 | Aug 2022 | 2453 | 2.88 | NMEMC | |||
River | 366 | 15 | BOD | 10 | 432 | 7 Jan 1980 | 27 Nov 1997 | / | 6.71 | GRQA |
COD | 10 | 235 | 3 Jan 1988 | 27 Nov 1997 | / | 6.81 | GRQA | |||
COD | 122 | 45 491 | 29 Oct 2007 | 24 Dec 2018 | / | 4.59 | CNEMC | |||
DIP | 3 | 9 | 6 Aug 1981 | 27 Nov 1983 | / | 0.00 | GRQA | |||
DO | 135 | 45 932 | 7 Jan 1980 | 24 Dec 2018 | / | 3.99/3.59 | CNEMC (45 459)/GRQA (473) | |||
DOC | 5 | 16 | 22 Jul 1981 | 21 May 2008 | / | 0.00 | GRQA | |||
DOSAT | 24 | 31 | 14 Jan 1986 | 11 Feb 1999 | / | 3.23 | GRQA | |||
NHN | 123 | 45 567 | 24 Feb 1983 | 24 Dec 2018 | / | 12.28/0.00 | CNEMC (45 562)/GRQA (5) | |||
NON | 13 | 334 | 6 Aug 1981 | 10 Nov 1997 | / | 7.19 | GRQA | |||
NON | 119 | 388 | 22 Jul 1981 | 5 Sep 2009 | / | 6.96 | GRQA | |||
pH | 251 | 46 181 | 21 Jan 1980 | 24 Dec 2018 | / | 0.50/0.99 | CNEMC (45 571)/GRQA (610) | |||
TDP | 3 | 16 | 12 Apr 1994 | 21 Oct 1996 | / | 0.00 | GRQA | |||
TEMP | 92 | 520 | 6 Feb 1980 | 5 Apr 2009 | / | 0.00 | GRQA | |||
TOC | 1 | 1 | 30 Aug 1994 | 30 Aug 1994 | / | 0.00 | GRQA | |||
TP | 10 | 196 | 7 Jan 1985 | 17 Oct 1996 | / | 15.31 | GRQA | |||
TSSs | 12 | 329 | 8 Jan 1980 | 22 Sep 1997 | / | 9.73 | GRQA | |||
Lake | 22 | 4 | COD | 22 | 6657 | 29 Oct 2007 | 24 Dec 2018 | / | 10.64 | CNEMC |
DO | 22 | 6656 | 29 Oct 2007 | 24 Dec 2018 | / | 2.48 | CNEMC | |||
NHN | 22 | 6667 | 29 Oct 2007 | 24 Dec 2018 | / | 6.90 | CNEMC | |||
pH | 22 | 6661 | 29 Oct 2007 | 24 Dec 2018 | / | 0.05 | CNEMC | |||
Reservoir | 5 | 4 | COD | 5 | 2231 | 29 Oct 2007 | 24 Dec 2018 | / | 8.70 | CNEMC |
DO | 5 | 2276 | 29 Oct 2007 | 24 Dec 2018 | / | 1.36 | CNEMC | |||
NHN | 5 | 2268 | 29 Oct 2007 | 24 Dec 2018 | / | 11.02 | CNEMC | |||
pH | 5 | 2252 | 29 Oct 2007 | 24 Dec 2018 | / | 0.27 | CNEMC |
Availability (Fig. 4a) and continuity (Fig. 4b) plots were used to examine the temporal fragmentation of the time series. Some dominant indicators (i.e., COD, DO, NHN, pH) were selected for presentation in Fig. 4. Our analysis revealed that observations from inland exhibited significantly higher availability and continuity than those from ocean areas. Specifically, for weekly water quality data, data availability for all indicators ranged from 40 % to 80 % (Fig. 4a), indicating good data availability. By contrast, observations from the ocean showed moderate availability while exhibiting low data continuity for most observations.
Figure 4
Overall availability (a) and continuity (b) for KMnO chemical oxygen demand (COD), dissolved oxygen (DO), ammonia nitrogen (NHN), and pH.
[Figure omitted. See PDF]
The presentation of outlier proportions is documented in Table 3. Among all indicator types, TP and NHN exhibited a higher proportion of outliers (Table 3). After the removal of outliers detected through the IQR test, boxplots were constructed for each indicator, illustrating a prominent positive skew in their distributions (Fig. 5). However, in the case of the TOC indicator, the generation of a boxplot was not informative due to the presence of only a single data point (Table 3), and as such, it was omitted from presentation in this context. This skewness behavior was consistent with the characteristics observed in the GRQA dataset. Conversely, indicators of DO and pH demonstrated a significant normal distribution across all three data sources.
Figure 5
Boxplots for all indicators with (a) biochemical oxygen demand (BOD), (b) chemical oxygen demand (COD), (c) KMnO chemical oxygen demand (COD), (d) dissolved inorganic nitrogen (DIN), (e) dissolved inorganic phosphorus (DIP), (f) dissolved oxygen (DO), (g) dissolved organic carbon (DOC), (h) dissolved oxygen saturation (DOSAT), (i) ammonia nitrogen (NHN), (j) nitrite nitrogen (NON), (k) nitrate nitrogen (NON), (l) potential of hydrogen (pH), (m) total dissolved phosphorus (TDP), (n) temperature (TEMP), (o) total phosphorus (TP), (p) total petroleum hydrocarbons (TPH), and (q) total suspended solids (TSSs). Outliers determined by the interquartile range (IQR) have been removed. The unit of indicators except TEMP (°C), pH (%), and DOSAT (%) is mg L.
[Figure omitted. See PDF]
4 ApplicationsGiven the amount of metadata information included in our inventory and the observations, this database will be particularly useful and important for researchers and decision-makers in the fields of hydrology, environmental research, water resources management, ecological studies, climate change, policy development, public health, and oceanography. For example, the indicator of NHN can be used by hydrologists to develop predictive models, calibrate nitrogen models, and generate projections within China. The inland, coastal, and oceanic water quality data can be connected to display the dynamics of water quality from land to ocean, thereby routing the import, transport, and export of pollutants. Researchers can use these data to analyze long-term trends and variations in surface water quality, which can be vital for understanding the impact of various factors such as climate change, pollution, and land use on aquatic ecosystems. Water resource managers can utilize this repository to assess the quality of water in different regions, helping to make informed decisions about water allocation, treatment, and conservation strategies. Policymakers can rely on this repository to support evidence-based policy development related to water quality standards and regulations. Health officials can use these data to monitor the safety of water sources and assess potential health risks associated with waterborne contaminants. The high intensity of coastal and oceanic water quality data can be used to indicate coastal and oceanic water environments for the food web (i.e., living conditions of plankton). For instance, phytoplankton and zooplankton communities are sensitive to the changes in water quality, and respond to low DO levels, high nutrient levels (i.e., DIN), and toxic contaminants (i.e., TPH). Therefore, this spatial continuous coastal and oceanic water quality dataset is helpful for characterizing the patterns of spatiotemporal distributions of plankton, assessing the status and trends of biodiversity, and predicting the population succession in the changing ocean world.
Certain studies have previously utilized specific segments of the original dataset. For instance, researchers have employed the weekly water quality data to examine the characteristics, trends, and seasonality of water quality in the Yangtze River (Di et al., 2019; Duan et al., 2018). It should be noted, however, that the complete dataset presented in this study has not been employed in any research thus far, which may limit the reliability of the dataset. In future, we plan to employ this dataset in upcoming research projects, where we will rigorously test its reliability.
5 Data availability
All data records can be found via the figshare repository at 10.6084/m9.figshare.22584742 (Lin et al., 2023b).
6 Conclusions
This water quality dataset was developed with the express purpose of addressing the substantial demand for Chinese water quality data, facilitating the enhancement of national water data sharing initiatives, and fostering advancements in global water-related research and applications. It provided a clean, editable, and sharable national water quality dataset within China, compiling three publicly available (sub)datasets from GRQA, CNEMC, and NMEMC. The current dataset included water quality data at 2384 sites (daily records at 244 sites, weekly at 149 sites, and monthly at 1991 sites) for the period of 1980–2022, with over 330 000 observations for 18 indicators across inland, coastal, and oceanic domains. The predominant share of observations, comprising approximately 98.9 %, originates from the CNEMC and NMEMC, significantly expanding the global water quality dataset with a notable emphasis on the Asian region.
This database will be particularly useful and important for researchers and decision-makers in the fields of hydrology, environmental management, and oceanography for advancing the assessment, modeling, and projection of water quality, ocean biomass, and biodiversity in China. Considering the extensive coverage of oceanic monitoring sites within this dataset, it has made a substantial contribution to the dissemination of coastal and oceanic water quality data, offering a comprehensive depiction of the aquatic environment, and facilitating researchers in conducting in-depth investigations into ocean ecosystem. Due to its comprehensive temporal coverage of riverine water quality data, this dataset presented a valuable adjunct for research requiring large-sample datasets and continuous information, especially for watershed modeling, such as modeling and projection of water pollutants.
This water quality dataset will be regularly updated to incorporate any new publicly released government data in China, ensuring prompt availability to the community for their immediate use. Considering the existing absence of biological parameters within the global water quality dataset, we have the intention to proactively incorporate relevant biological parameters in the event of new government data releases. This dataset also introduces the metadata framework for forthcoming national datasets, a comprehensive collection of water-related data throughout China that aims at providing free, clean, non-sensitive, coherent, and reliable water data within China for global researchers to support the national water resources management and further promote Asian water data sharing in the future.
Author contributions
YC and ZY were involved in planning and supervised the work. JL, PW, and JW designed the code. JL carried out the data processing with contributions from PW and JW. JL mapped the monitoring sites and developed the outlier detection strategy. YZ helped improve the grammar and flow of the manuscript. JL prepared the manuscript and JW, YZ, XZ, PY, and HZ provided critical feedback and helped shape the research, analysis, and manuscript.
Competing interests
The contact author has declared that none of the authors has any competing interests.
Disclaimer
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.
Financial support
This work was supported by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams (grant no. 2019ZT08L213) and the National Natural Science Foundation of China (grant nos. U20A20117, 52200213, and 52239005).
Review statement
This paper was edited by Yue Qin and reviewed by four anonymous referees.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Water quality data represent a critical resource for evaluation of the well-being of aquatic ecosystems and assurance of clean water sources for human populations. While the availability of water quality datasets is growing, the absence of a publicly accessible national water quality dataset for both inland and the ocean in China has been notable. To address this issue, we utilized R and Python programming languages to collect, tidy, reorganize, curate, and compile three publicly available datasets, thereby creating an extensive spatiotemporal repository of surface water quality data for China. Distinguished as the most expansive, clean, and easily accessible water quality dataset in China to date, this repository comprised over 330 000 observations encompassing daily (3588), weekly (217 751), and monthly (114 954) records of surface water quality covering the period from 1980 to 2022. It spanned 18 distinct indicators, meticulously gathered at 2384 monitoring sites, which were further categorized as daily (244 sites), weekly (149 sites), and monthly (1991 sites), ranging from inland locations to coastal and oceanic areas. This dataset will support studies relevant to the assessment, modeling, and projection of water quality, ocean biomass, and biodiversity in China, and therefore make substantial contributions to both national and global water resources management.
This water quality dataset and supplementary metadata are available for download from the figshare repository at 10.6084/m9.figshare.22584742 (Lin et al., 2023b).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details



1 Guangdong Provincial Key Laboratory of Water Quality Improvement and Ecological Restoration for Watersheds, School of Ecology, Environment and Resources, Guangdong University of Technology, Guangzhou 510006, China; Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), Guangzhou 511458, China
2 Coastal and Ocean Management Institute, College of the Environment and Ecology, Xiamen University, Xiamen 361102, China
3 School of Life and Environmental Sciences, Deakin University, Burwood, Vic 3125, Australia
4 Department of Ocean Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
5 Global Hydrological Prediction Center, Institute of Industrial Science, The University of Tokyo, Tokyo 153-8505, Japan
6 CAS Key Laboratory of Tropical Marine Bio-Resources and Ecology, South China Sea Institute of Oceanology, Chinese Academy of Sciences, Guangzhou 510301, China