Estimation of Chlorophyll-a Concentration in Lake Taihu from Gaofen-1 Wide-Fieldof-View Data through a Machine Learning Trained Algorithm

2022-03-12 07:45XinHANGYachunLIXinyiLIMengXUandLiangxiaoSUN
Journal of Meteorological Research 2022年1期

Xin HANG, Yachun LI*, Xinyi LI, Meng XU, and Liangxiao SUN

1 Jiangsu Climate Center, Jiangsu Meteorological Bureau, Nanjing 210008

2 Nanjing Joint Institute for Atmospheric Sciences, Nanjing 210009

ABSTRACT

Key words: chlorophyll-a concentration, Gaofen-1 (GF-1), wide-field-of-view, random forest algorithm, Lake Taihu

1. Introduction

Lake eutrophication and harmful algal blooms are common ecological problems of water ecosystems globally (Dai et al., 2016; Zhu et al., 2017). Because chlorophyll-a (Chl-a) is the most abundant pigment in phytoplankton and algae, accurate observation of its concentration is of great importance in both assessing the degree of eutrophication of water bodies and promoting water environment management and ecological protection (Harvey et al., 2015).

Lake Taihu is the third largest freshwater lake in China, which is characterized by typically large shallowwater. With the extent climate warming and eutrophication, the frequency, intensity, and annual duration of cyanobacteria blooms have been increasing for decades.Therefore, there is an urgent need for effectively monitoring and managing water quality in Lake Taihu, and for a broader understanding the optical, biological, and ecological processes and phenomena in all fresh inland waters.

Although in-situ measurements can provide details on inland lake water quality, they have limitations, in particular, the lack of spatial coverage. With much higher temporal and spatial coverages, various satellite sensors have been widely used for estimation of Chl-a in large water bodies (He et al., 2020). In principle, the optical radiometric measurements of clean oceanic waters allow for accurate monitoring of ocean color, which is primarily governed by Chl-a and any of its accessory pigments.Remotely sensed aquatic signals recorded at the top-ofatmosphere (TOA) are bulk optical properties emanating from the absorption and scattering of solar photons in the atmosphere, within the water column, and at the air–water interface. After removing the atmospheric effects, the TOA signal is reduced to the remote sensing reflectance,from which in-water optical properties are quantified(Wang et al., 2013; Pahlevan et al., 2020). Therefore, the Chl-a is usually obtained through the following two stages: (1) application of an atmospheric correction (AC)algorithm to generate normalized water-leaving radiance spectra, and (2) generation of Chl-a using the obtained water-leaving radiance spectra data (Gordon and Wang,1994a; Wang et al., 2012). However, those standard Chl-a retrieval algorithms originally developed for open ocean waters tend to fail when applied to more turbid inland and coastal waters whose optically properties are strongly influenced by non-covarying concentration of non-algal particles and colored dissolved organic matter(IOCCG, 2006; Matthews, 2011; Palmer et al., 2015). In addition, the continentality of the atmospheres overlying in-land and coastal waters and the proximity of the adjacent land surface mean that standard approaches to AC over ocean waters are not always reliable (Mouw et al.,2015; Palmer et al., 2015). There is a clear need to develop and validate atmospheric and in-water algorithms specifically for turbid inland and coastal waters.

Many efforts have been made in recent years to retrieve Chl-a from coastal and turbid in-land waters. Multiple algorithms have been developed to retrieve Chl-a from multispectral and hyperspectral images for over coastal and in-land waters (Xu et al., 2019), including the empirical models (Zhou et al., 2011; Qi et al., 2014),semi-analytical algorithms (Gordon et al., 1988; Li et al.,2006), and machine learning (ML) models (Jiang et al.,2013; Zhu et al., 2017; He et al., 2020). Of these algorithms, it is difficult to estimate Chl-a in turbid and severely eutrophied lakes using traditional empirical or semi-analytical because these models rely on blue–green wavelengths where water constituents other than phytoplankton often dominate the optical properties and satellite-derived water leaving radiance in these wavelengths often contains substantial uncertainties (Palmer et al.,2015; Shi et al., 2017). ML, a subset of artificial intelligence, offers the ability of in-depth mining of data features through nonlinear and complex calculations, showing superior performance (Lary et al., 2016). Multiple ML algorithms have been used to estimate Chl-a, such as the neural network (Zhu et al., 2017), support vector machine (Zhang et al., 2009; Kong et al., 2017), and random forest (RF; Zhang et al., 2018). The RF is an ensemble ML algorithm (Zhang et al., 2020; Xu et al.,2021), providing a nonparametric, nonlinear, and multivariate regression analysis with high performance when estimating and predicting Chl-a (Yajima and Derot,2018; Zhang et al., 2018).

Previous studies have generally used an AC algorithm to remove the atmospheric and surface effects from satellite observations in order to retrieve the water-leaving radiance, which can be used to produce water color products such as the Chl-a concentration. However, removing a large signal and deriving a very small signal accurately from the water are the major challenge in the turbid inland lake conditions (Wang, 2007; IOCCG, 2010). Those conventional and modified AC algorithms (Gordon and Wang, 1994b; Shi and Wang, 2007) may produce doubtful results in turbid inland lakes mainly due to the difficulty in deriving normalized water-leaving radiance spectra data (Jamet et al., 2011; Goyens et al., 2013; Singh and Shanmugam, 2014; Fan et al., 2017).

Therefore, we introduce an RF based approach for Chl-a retrieval over Lake Taihu. The alternative approach is not to attempt removing atmosphere effects, but to extract the Chl-a information directly from the TOA reflectance observed by satellites. This may avoid the need to retrieve the water-leaving radiance from application of an AC, which is typically prone to large errors in turbid and high-biomass waters. In addition,Gaofen-1(GF-1) is the first high-resolution earth observation satellite of China and equipped with four 16-m-resolution multispectral wide-field-of-view (WFV) sensors. The high resolution ofGF-1provides great convenience for the inversion of water quality parameters. However, there are currently few studies on the use of theGF-1satellite for Chl-a retrieval in inland lakes (Zhu et al., 2017). The primary objective of this study is to develop a model for Chl-a retrieval with the high spatial resolution of 16 m fromGF-1WFV. A fine-resolution and high-frequency Chl-a concentration would benefit to such small-scale water environmental and ecological studies as in Lake Taihu. The remainder of the manuscript is structured as follows. In Section 2, the satellite data and in-situ measurements used in the study are presented. In Section 3, the ML process adopted for Chl-a concentration retrieval is described. The retrieval results are then described in Section 4. A discussion is presented in Section 5, followed by the conclusions in Section 6.

2. Data

2.1 Study region

Lake Taihu is one of the largest freshwater lakes in China. It is located in the southeast portion of the Yangtze River Delta (30°55′–31°30′N, 119°55′–120°40′E), covering an area of 2338 km2with an average water depth of 2 m. Following the rapid development of China’s economy in the previous 30 years, increasing volumes of industrial, agricultural, and domestic sewage from surrounding cities have been discharged continually into Lake Taihu. This process has caused the water to become eutrophic, resulting in frequent occurrence of large-scale cyanobacterial blooms (Li et al., 2016; Shi et al., 2017).

2.2 In-situ Chl-a concentration

The Department of Ecology and Environment of Jiangsu Province established 19 water quality buoy sites in Lake Taihu (as shown in Fig. 1) to automatically receive water quality data. Although few in number, these sites are generally distributed uniformly throughout the lake,and the observations can represent the water quality distribution in the lake. All the buoy sites are equipped with a multiparameter water quality monitoring instrument(model: YSI6600). The observational data include 14 water quality parameters such as Chl-a concentration, algae density, and dissolved oxygen concentration. The observation instrument is calibrated once a week, and the actual water sample comparison experiment is carried out at the same time to ensure the stability and reliability of the data. The Chl-a selected in this study represent instantaneous values observed simultaneously with the time of overpass (1130 BT) of theGF-1satellite.

2.3 Satellite data

TheGF-1satellite, which is the first high-resolution satellite developed by China, was launched on 26 April 2013. This satellite is equipped with a 2-m-resolution panchromatic sensor, 8-m-resolution multispectral sensor,and four 16-m-resolution multispectral WFV sensors.This study uses the data acquired by the 16-m-resolution WFV sensors. Table 1 lists the basic parameters of the WFV sensors onboardGF-1.

TheGF-1WFV data are acquired from January 2018 to May 2019 and include the TOA reflectance at four bands. The 27GF-1images selected for model training and cross-validation covered 18 days (Table 2). Furthermore, the TOA reflectances observed byGF-1on 4 June and 13 December 2019 are also selected as model input data for increasing the node impurity (INIP) Chl-a retrieval. These data are used only for independent testing of the performance of the models, that is, not for training or cross-validation purposes. All images underwent orthorectification, radiometric calibration, and image mosaicking processing.

Fig. 1. Distribution of the 19 water quality buoy sites in Lake Taihu, China. All buoy sites are equipped with a multiparameter water quality monitoring instrument (model: YSI6600) and provide in-situ measurements of Chl-a concentration required for model training and validation.

Table 1. Parameters of the GF-1 WFV sensors

Table 2. GF-1 images selected for model training and crossvalidation

For comparison, Earth Observing System (EOS) Moderate Resolution Imaging Spectroradiometer (MODIS)data are also collected for retrieving cyanobacterial blooms. The commonly used normalized difference vegetation index (NDVI) is used to extract cyanobacterial bloom information from all available MODIS data collected during 2007–2018. In this paper, the use of cyanobacterial blooms in the area and the times of occurrence of two indicators, wherein the indicator area is determined by setting a threshold NDVI, and the times are determined based on the area (Li et al., 2016).

2.4 Sample preparation and analysis of latent variables

From all the measured Chl-a concentration acquired at the time of overflight of theGF-1satellite on the 18 days, excluding 152 missing and abnormal data (mainly the influence of the cloud), 190 matching data elements are obtained to form the Chl-a sample dataset for model training and cross-validation purposes. Grouped by season, 57 spring (March–May), 30 summer (June–August),20 autumn (September–November), and 83 winter (December–February) sample subsets of measured Chl-a are obtained. Additionally, the measured Chl-a on 4 June and 13 December 2019, are used to independently test the performance of the models. Figure 2 shows the frequency distributions of the Chl-a sample dataset. It can be seen that the distribution of Chl-a concentration is mainly concentrated between 4 and 10 mg m−3, of which the winter dataset accounts for the highest proportion of 95.2%, followed by the whole year dataset with 88.4%,and the summer dataset with the lowest proportion of 73.4%. The maximum Chl-a concentration is 37.4 mg m−3, which appears in autumn; and the minimum is 0.9 mg m−3, which appears in spring. From the mean Chl-a concentration in the four seasons, the maximum is 9.6 mg m−3in summer, followed by autumn (8.0 mg m−3),and the minimum in winter (6.6 mg m−3). The distribution range of the Chl-a sample dataset used for modeling is reasonable, and the seasonal concentration changes are also consistent with the actual situation, which can be used for model training and cross-validation.

TheGF-1WFV TOA reflectance is used as the input for the RF model. It is considered reasonable to use the reflectance in the fourGF-1WFV bands for water quality parameter retrieval because the reflectance of a single band is a complex function of many parameters (Kong et al., 2017). Following Fang et al. (2019), considering the 4 single bands and the main vegetation index commonly used to retrieve Chl-a, and including some other wave band combinations of theGF-1WFV, 39 variables constructed are selected as the latent variables for model filtering (Table 3).

3. Methodology

3.1 Model developments and validation

The mathematical details and structure of the RF algorithm have been discussed elsewhere (Breiman, 2001;Iverson et al., 2008); therefore, only a brief introduction to the RF algorithm is given here. The RF is a type of supervised ensemble ML technique that uses multiple decision trees and bootstrap aggregation to provide a nonparametric, multivariable, and nonlinear regression. First,“k” features are selected at random from the total features and used to calculate the root node via the best split approach. Then, a tree is constructed by using the root node. Multiple randomly constructed trees generated from the above process are then used to build multiple decision trees. Finally, each prediction from the multiple trees is merged to obtain the average results (Liaw and Wiener, 2002).

The RF model used here is developed by incorporating in-situ Chl-a and TOA reflectance to estimate the Chl-a concentration. The input variables included the insitu Chl-a, reflectance variables of band combinations of theGF-1WFV, and the latitude and longitude coordinates of the water quality buoy sites. The use of latitude and longitude as variables accounts for the spatiotemporal variation of the Chl-a. The performance of the models is compared by using different settings of the number of trees (ntree) and the number of variable per level (mtry),and the optimal model performance is achieved whenntreeis assigned the value of 600 andmtryis assigned the value of one-third of the total number of variables of each model. It is noted that the RF is a supervised ML algorithm; thus, although the in-situ Chl-a is critical for model fitting, it is not necessary for model application.

Fig. 2. Statistical distributions of frequency for the Chl-a sample dataset constructed by (a) the whole year dataset, (b) the spring dataset, (c) the summer dataset, (d) the autumn dataset, and (e) the winter dataset. The percent value is the ratio of the count of each Chl-a to the total count of the samples in the dataset. All Chl-a widths are 2 mg m−3 for each panel.

The Monte Carlo cross-validation (MCCV) technique is used to assess the potential of model fitting and model robustness (Ghorbanzadeh et al., 2020). The MCCV technique is an asymptotically consistent method for model selection. It can avoid an unnecessary large model,decrease the risk of over-fitting in model training, and has a relatively high probability for choosing the most appropriate model. Here, the sample dataset is split randomly into two subsample datasets: one subset containing 25% of the total samples is used to validate the model,and the other subset containing the remaining 75% of the samples is used to train the model. Such an independent model training and test procedure are repeated multiple times, and the average of these test results is taken as an indicator with which to verify the accuracy of the model.Several statistical indicators are used to quantitatively evaluate the model performance, that are, the coefficient of determination (R2), the root-mean-square error(RMSE), and the mean absolute percentage error(MAPE) between the cross-validation predicted and observed Chl-a. The MAPE is calculated as follows:

3.2 Relative importance evaluation indicator (RIEI)

The selection of the most effective band plays an important role in accurate estimation of Chl-a (Goyens etal., 2013; Jiang et al., 2013). Some previous studies have determined the Chl-a spectral characteristics and sensitive bands of lake water bodies through statistical analysis or field measurement (Wu et al., 2009; Yang et al.,2011). However, for relatively turbid inland water bodies, owing to the presence of phytoplankton, suspended matter, dissolved organic matter, and many other substances that affect the absorption spectrum of Chl-a, all components mix and interact with each other such that their spectral characteristics are more complicated. The actual measured Chl-a spectral reflectance and absorption peaks between these water bodies will also have significant differences (Luo et al., 2017). Moreover, most such measurements have been undertaken in specific water bodies and therefore they lack extensive validation.

Table 3. The 39 latent variables involved in the random forest modeling

The RF algorithm can be used to quantitatively assess the relative importance of each variable to the model.Thus, certain irrelevant or redundant characteristic variables can be excluded from the initial large number of variables, and a small number of characteristic variables that contribute most to the model can be filtered to obtain a more accurate model. The contribution of each input variable in each model is evaluated based on two factors: the percentage for increasing the mean square error (IMSE) and INIP. As the percentage for IMSE or the INIP increases, the contribution of the variable to the Chl-a retrieval increases. Previous studies generally use only one of the indicators of IMSE and INIP to select the important variables, but this practice could be inaccurate.Therefore, a novel relative importance evaluation index(RIEI), which is developed by combining IMSE and INIP to measure the relative importance of the variables,is used to filter the important variables as model inputs.The RIEI is calculated as follows:

3.3 RF model training procedure

First, a representative training dataset is needed for RF model training to estimate Chl-a concentration. TheGF-1WFV TOA reflectance and the synchronously observed Chl-a for 18 days from January 2018 to May 2019 are used as the training dataset. The 190 measured Chl-a are selected and grouped by season to obtain 5 sample subsets for model training. Three-quarters of the data are selected at random from each sample subset as the training sample set, and the remaining quarter of the data are taken as the validation sample set.

The selected important variables are considered as the input variables and the RF model is trained separately,where the value of parametermtryis set as one-third of the number of characteristic variables, and it adopts the four values of 1, 2, 3, and 4 successively. The parameterntreeadopts the value of 400, 500, or 600 based on the previous error analysis results. Each set of corresponding parameter combinations (mtry,ntree) is repeated multiple times in the modeling procedure, and the parameter combination with the highest accuracy of each model is selected. A detailed flowchart of the process for determining the optimal RF model is shown in Fig. 3.

4. Results

4.1 Determination of effective wave band combinations

First, we select the band combinations based on the single index of either IMSE or INIP for comparison purposes. The above-determined parametersmtryandntreeare used for modeling and optimization, and IMSE and INIP values and their ranking of each variable in each model are obtained separately, based on which the top ranked variables are regarded as the important variables of model input. For comparability of the results, we select the same number of important variables in each model.The result is that MODYear, MODSpr, MODSum, MODAut,and MODWinhave 6, 4, 4, 9, and 4 important variables,respectively (Tables 4, 5).

From the results of the above two types of filtering, it can be found that the important variables of each model are significantly different. The degree of coincidence of the important variables of the five models is only 67%, 0,50%, 22%, and 75%, and the ranking order of the variables is markedly different. Only the variables ranked first in model MODWinare the same, while the order of the variables of the remaining four models are all different. From this analysis, it is not difficult to conclude that it might be inappropriate to use only one of the indicators to measure the importance of the variables and to use this as a criterion for selecting important variables.

Fig. 3. Detailed flowchart of the process for determining the optimal RF model.

Table 4. Model filtering results based on the IMSE indicator

Table 5. Model filtering results based on the INIP indicator

In the following, we determine the effective band combinations based on the RIEI. As above, the results of the evaluation of the importance of the variables in MODYear, MODSpr, MODSum, MODAut, and MODWinare obtained according to the RIEI. As an example, the ranking of the importance of the variables in MODSumand MODWinis shown in Figs. 4a and 4b, respectively.

According to the results of the importance evaluation,the important characteristic variables of the five models are filtered out. The results, including MODYear, MODSpr,MODSum, MODAut, and MODWin, have 6, 8, 12, 5, and 5 important variables respectively (Table 6).

Fig. 4. Importance of each input variable in the RF training models of (a) MODSum and (b) MODWin derived from the RIEI. For each model,only the first 15 RIEI values of the 29 latent variables are shown as examples. The larger the RIEI value, the larger the contribution of the variable to the Chl-a concentration retrieval.

Table 6. Model filtering results based on the RIEI indicator

From the band combinations in the five models, we can obtain some interesting findings. The first is that each of the models contains all four bands ofGF-1WFV,indicating that each band ofGF-1WFV contains the Chl-a information. In fact, the spectral ranges of 4 multispectral bands ofGF-1WFV are very similar to that of the 4 bands of Landsat TM/ETM+ [i.e., blue (0.45–0.52 μm),green (0.52–0.60 μm), red (0.63–0.69 μm), and NIR(0.76–0.90 μm)]. They are the most suitable Landsat TM/ETM+ bands for characterizing Chl-a in complex coastal waters (Qi et al., 2014). Recent studies on the inversion of Chl-a also show that various combinations of the 4GF-1WFV bands are correlated with Chl-a in various seasons or regions (Xie et al., 2019). Secondly, in the variables of the five models, there is only one singleband variable, and the rest is multi-band combination variables, representing that the performance of the multiband combination methods is better than the single-band methods. This similar result has been confirmed by extensive previous studies (Cheng et al., 2013). In addition,we can find that each model contains more band ratio variables, such as ratio vegetation index [RVI(1,2), RVI(1,3),RVI(2,3), and RVI(2,4)], and some difference vegetation indices, such as DVI(1,2), DVI(2,3), etc. Chl-a is the primary photosynthetic pigment in terrestrial green plants and phytoplankton in water, which is strongly absorbent of the blue (B1) and red (B3) spectral regions, and highly reflective of the green (B2) and NIR (B4) spectral regions,indicating a similarity between the spectral reflectance of algae-containing water and terrestrial vegetation (Cheng et al., 2013). Therefore, it is reasonable to include some vegetation index or similar band ratio variables in the model. Various types of vegetation index and similar band ratios are used for Chl-a retrieval in previous studies (Cheng et al., 2013; Xie et al., 2019). However, the model band combinations used by different algorithms may vary owing to the great spatial and temporal changes in the biophysical characteristics of turbid waters.Moreover, due to the variety of sampling times and positions, even the models of the same water body may vary from different authors. And the band combinations from different authors may also differ, even if the same modelbuilding method is used (Cheng et al., 2013). Therefore,it is reasonable to believe that the RF models with the optimal band combinations ofGF-1WFV are feasible to estimate Chl-a in Lake Taihu.

Although these findings are encouraging, the RF derived algorithm is also a black box model and has main drawback that it is not easy to give a clear interpretation of decision procedure. However, this drawback could be compensated to some extent by initially selecting variables with physical meaningful and selecting the optimal band combinations according to the variable importance.

4.2 Performance of RF training models

TheR2values of MODYear, MODSpr, MODSum,MODAut, and MODWinwith the highest accuracy, are listed in Table 7. Additional, the model training is also performed by using the variables filtered by IMSE and INIP,separately, and the results are also presented in Table 7 for comparison. Among the five RIEI trained models, theR2values of MODYear, MODSpr, MODSum,and MODAutare all more than 0.9,R2of MODWinis 0.89.In the IMSE trained models, onlyR2of MODYearis more than 0.9, and only two of the INIP trained models(MODSprand MODAut) haveR2more than 0.9. The maximumR2of the five RIEI trained models is 0.98, which is slightly smaller than the maximum value (0.99) of any INIP trained model, but markedly greater than the maximum value (0.91) of any IMSE trained model. The minimumR2of the five RIEI trained models is 0.89, which is markedly greater than that of any model trained with the other two indicators. Moreover, among the five RIEI trained models, onlyR2of MODAutis slightly lower than that of any INIP trained model, andR2values of the other models are higher than or equal to those of the models trained with the other two indicators. The results indicate that the fitting of the models trained by the RIEI is better than that achieved for the models trained by a single indicator.

To illustrate intuitively the performance of the models trained with the different indicators, scatter plots of the fitting results of the RF models are shown in Fig. 5. It can be seen that the performance of the RIEI trained models is relatively stable with the RMSEs of 1.24–2.67 mg m−3. The RMSEs of the models are all less than 2 mg m−3, except for that of MODSum, and the MAPEs fall in the range of 16.83%–27.26%. However, the maximum RMSE of the IMSE trained models is 5.98 mg m−3, and the RMSE of the INIP trained MODAutis 5.14 mg m−3.The RIEI trained models show that the performance issuperior to the single indicator trained model.

Table 7. Coefficient of determination (R2) of the five RF models trained by using the RIEI and single IMSE and INIP indicators

Fig. 5. Performances of the five RF models (MODYear, MODSpr, MODSum, MODAut, and MODWin) trained with the RIEI, and the single IMSE and INIP indicators, using a test dataset.

4.3 Validation of the RF models

The MCCV is used to validate the performance of the RF models. Comparison of the Chl-a derived from in-situ measurements and that retrieved by the RF models (MODYear, MODSpr, MODSum, MODAut, and MODWin) is shown in Fig. 6. Despite the limited number of matchup data, the RF models derived Chl-a generally agree well with the in-situ measurements. There is high correlation between the Chl-a retrieved by each model and the insitu measured values, and the RMSE is small. The MCCVR2of the five models is in the range of 0.71–0.94 and the RMSEs are in the range of 1.40–4.30 mg m−3.Among them, MODAuthas the highest accuracy (R2=0.94, RMSE = 1.66 mg m−3) followed by MODSprand MODSum(R2values of 0.88 and 0.88 and RMSE values of 2.94 and 2.59 mg m−3, respectively). The accuracy of MODYearis lowest (R2= 0.71) and its RMSE (4.30 mg m−3) is markedly higher than that of the other four seasonal models. The MAPEs of the four seasonal models vary between 18.47% and 25.59%, substantially lower than the value of 38.32% of MODYear. This result shows that the RF models using these optimizedGF-1WFV band combinations can accurately estimate the Chl-a in Lake Taihu. Among them, the performance of the model with all samples is obviously inferior to that of the seasonal models, and the performance of MODAutis better than that of the other three seasonal models.

Examples of the performance of the models for nonbloom and bloom cases are shown in Fig. 7. The twoGF-1WFV RGB images composed of three channelsB1,B2,andB3show two situations: no blooms in winter (Fig.7a) and blooms in autumn (Fig. 7b), respectively. The retrieved Chl-a distribution for the two cases is shown in Figs. 7c, d, respectively. For the non-bloom case on 13 February 2018, the spatial distribution of the Chl-a throughout the lake is reasonably uniform, with the maximum and minimum values being approximately 8 and 3 mg m−3, respectively. In the case of algal bloom on 27 October 2018, the spatial distribution of the Chl-a is highly consistent with that of the cyanobacterial bloom,and spatial heterogeneity is substantial, with the maximum and minimum values being approximately 30 and 3 mg m−3, respectively. It demonstrates that our models can estimate the Chl-a in a large range of water and reveal its distribution in details.

Fig. 6. Scatter plots of the cross-validation results of our five models: (a) MODYear, (b) MODSpr, (c) MODSum, (d) MODAut, and (e) MODWin.

Fig. 7. Performances of the Chl-a concentration retrieval RF models in cases without algal bloom (13 February 2018) and with algal bloom (27 October 2018): (a) and (b) GF-1 WFV RGB images without algal bloom (in winter) and with cyanobacterial algal bloom (in autumn), respectively, (c) and (d) Chl-a concentration for the same two cases derived by using the RF models, respectively.

To further evaluate the performance of our RF models,theGF-1WFV TOA reflectance on 4 June and 13 December 2019, are selected as input data, and the in-situ measured Chl-a at the corresponding times are taken as independent test datasets. These datasets are not used for model training or cross-validation purposes. The two datasets also represent two different situations: blooms in summer (4 June 2019) and no blooms in winter (13 December 2019), and the Chl-a of these 2 days are retrieved from MODWinand MODSum, respectively. Figure 8 shows theGF-1WFV RGB images, distribution of retrieved Chl-a, and scatter plots of the in-situ measured and the models retrieved Chl-a. It can be seen in the case without algal blooms that the distribution of MODWinretrieved Chl-a is more uniform, whereas in the case with algal blooms, the distribution of MODSumderived Chl-a is consistent with that of the cyanobacterial blooms. For MODWinand MODSum, tested with independent samples,R2are 0.49 and 0.61, RMSEs are 2.34 and 5.31 mg m−3,and MAPEs are 32.74% and 31.83%, respectively. Overall, the models estimate the Chl-a well, althoughR2values are reduced slightly, and RMSEs and MAPEs are also slightly larger.

4.4 Temporal variation and trend

To investigate the annual variation of Chl-a in Lake Taihu, the seasonal models are used to retrieve the 15-day Chl-a concentration in 2018, and the average value in the entire lake on each day is obtained, as shown in Fig. 9. Owing to the lack of matchedGF-1WFV images,the estimated Chl-a for certain months (March, August,September, and November) is missing, but it can still be seen from Fig. 9 that the Chl-a in the lake exhibits an obvious seasonal trend. Although the lake shows relatively high Chl-a during all seasons, the Chl-a is substantially higher during summer (June–August) and autumn (September–November) in comparison with spring (March–May) and winter (December–February). The average Chl-a concentration in spring, summer, autumn, and winter is 7.7, 9.6, 8.6, and 7.1 mg m−3, respectively. This temporal trend is largely consistent with previous research (Shi et al., 2017). As a comparison, Fig. 10 shows the monthly average area of the cyanobacterial bloom derived from the NDVI model using all available MODIS images collected during 2007–2018. The higher temporal resolution of MODIS allows for more frequent monitoring of the cyanobacterial blooms, and the time series of the area of cyanobacterial blooms can characterize the long-term trends in the cyanobacterial dynamics of Lake Taihu. Generally, as the Chl-a in the lake increases, the possibility of a cyanobacterial bloom increases. For cyanobacterial bloom areas, a clear seasonal cycle is observed in Lake Taihu, whereby cyanobacterial blooms occurred much more often during summer and autumn and less frequently in spring and winter. The extent of cyanobacterial bloom areas are also considerably higher in summer and autumn than in spring and winter. Under normal circumstances, after a long-term bloom-free period in winter, surviving cyanobacteria float to the water surface during spring, which means the Chl-a increases gradually and the intensity of the cyanobacterial bloom increases substantially. However, we find that the areas of cyanobacterial blooms in June and July, which are the time of year affected primarily by the Jianghuai Meiyu (a long-term rainy weather occurs almost every year in this area), are smaller than in May (Fig. 10). Even under the influence of the Jianghuai Meiyu, summer is still considered the season with the largest intensity and greatest areal extent of cyanobacterial blooms. The retrieved Chl-a and the observed cyanobacterial blooms in Lake Taihu are mutually corroborated by the temporal change trend.

Fig. 8. Validations of the RF models using independent sample datasets in the case of no algal blooms (13 December 2019) and algal blooms (4 June 2019): (a) and (b) GF-1 WFV RGB images, (c) and (d) distribution of Chl-a concentration retrieved by RF models (MODWin and MODSum),and (e) and (f) scatter plots of Chl-a concentration derived from in-situ measurements versus those retrieved by using the models, respectively.

Fig. 9. Temporal variation of Chl-a concentration in Lake Taihu in 2018 estimated by using the RF models. The red line is the 2-day moving average of the Chl-a concentration.

4.5 Spatial distribution

Fig. 10. Temporal variation in the monthly average area of cyanobacterial blooms in Lake Taihu derived from EOS/MODIS acquired during 2007–2018. The red line is the 2-month moving average of the cyanobacterial bloom area.

To analyze the annual spatial distribution of Chl-a in Lake Taihu, the retrieved Chl-a concentration for 15 days in 2018 are averaged pixel by pixel to obtain a composite map of the spatial distribution of Chl-a in Lake Taihu in 2018 (Fig. 11a). Additionally, Fig. 11b shows the number of cyanobacteria blooms in each pixel obtained from MODIS. Although the number of matchedGF-1WFV images (15) is far below that of the MODIS images (134) with cyanobacterial blooms, it can be seen that the spatial distribution of Chl-a retrieved by our models generally has high consistency with that of the number of cyanobacterial blooms. The northwestern part of the lake is the area with the greatest average Chl-a concentration in the year. Correspondingly, this area is also the area in which cyanobacterial blooms occur most frequently and where the water quality is the poorest.The eastern coastal area, with the best water quality, has the lowest Chl-a and fewer occurrences of cyanobacterial blooms. It should be noted that due to the abundance of aquatic plants, the Chl-a in the southeast corner is also relatively high and remains higher throughout the year(except in winter; Zhang Z. et al., 2018). These results are generally consistent with those of previous research(Qi et al., 2014).

Fig. 11. Spatial distributions of (a) Chl-a concentration retrieved from RF models and (b) the cumulative number of occurrences of cyanobacterial blooms derived from EOS/MODIS in Lake Taihu in 2018. It is worth noting that only cyanobacterial blooms are concerned in (b) and usually there is only aquatic plants and no cyanobacterial blooms in southeast of Lake Taihu.

Figure 12 shows the seasonal spatial distribution of the retrieved Chl-a in Lake Taihu in 2018. The Chl-a exhibits obvious seasonal variability, which the Chl-a is substantially higher during summer and autumn than in spring and winter. Except for winter, there are obvious spatial changes in Chl-a in spring, summer, and autumn.In spring, the northwestern part is the area with the highest Chl-a, followed by the southern coastal and some central areas of the lake. In summer, the Chl-a in most areas of the northwest, west, and south of the lake is obviously higher, and only a few areas along the east coast have relatively low Chl-a. Compared with summer, the area with high Chl-a substantially reduces in autumn and mainly locates in the western coastal and northern of the lake. In winter, the Chl-a in the entire lake is markedly reduced, and the spatial distribution is more uniform.This result is consistent with the actual situation. Notably, the areas of high Chl-a in the three seasons other than winter are located mainly in the northwestern part of the lake, which is attributable mainly to the large number of rivers entering the lake and high density of urban sewage outfalls in these areas. The high nutrient concentration leads to serious eutrophication of the water body,which is beneficial to the growth of algal organisms (Dai et al., 2016; Zhu et al., 2018). Additionally, the spatial distribution pattern of Chl-a is also affected by meteorological conditions. The low and uniform distributions of Chl-a in winter are mainly due to the low temperature that causes algae in the water to almost stop growing(Xiong et al., 2012). The prevailing southeasterly wind in summer and autumn causes algal organisms to become concentrated in the northwest region of the lake (Qin et al., 2004; Li et al., 2016). This confirmed that the Chl-a spatial distribution patterns derived from the RF models appear reasonable.

Fig. 12. Spatial distributions of Chl-a concentration in Lake Taihu, as estimated by the RF models: (a) spring, (b) summer, (c) autumn, and (d)winter.

5. Discussion

5.1 Comparison with recent studies

Table 8 summarizes the model performances of different algorithms reported in previous studies on estimation of Chl-a in Lake Taihu using satellite. The spatial resolutions of the different models range from 16 m to 1 km,with most of the studies having coarse spatial resolutions of greater than 16 m. The CVR2of the different models varies from 0.47 to 0.94, with mostR2values less than 0.88. The seasonal RF models captures about 86% average of the variability in Chl-a concentration in the sample-based CV, which is larger than the sample-based CVR2of almost other models. The RMSE values of the seasonal models are generally lower than that of the other models. Our RF derived models also shows that the performance is comparable or superior to some previous models based on different ML algorithms (Zhang et al.,2009; Xu et al., 2019). Overall, the seasonal models have a robust and superior performance in estimating Chl-a with an extremely high spatial resolution of 16 m.

5.2 Advantages of the Chl-a retrieval models

The greatest advantage of the proposed models developed in this study is its high spatial resolution of 16 m. TheGF-1satellites can monitor the details of small patches of inland lakes better than the 250-m spatial resolution of MODIS. The retrieval results show that theGF-1WFV images describe more details of the spatial distribution pattern of Chl-a in Lake Taihu than the MODIS images (Qi et al., 2014; Lary et al., 2016). Several previous studies have also shown the good performance ofGF-1WFV in retrieving Chl-a over inland lakes(Zhu et al., 2017; Xu et al., 2020). Another difference from previous studies is that this study directly captures the Chl-a information fromGF-1WFV TOA reflectance without an AC algorithm. This may avoid the need to retrieve the water-leaving radiance through AC, which is typically prone to large errors in turbid and high-biomass waters. The seasonal and spatial distribution of Chla in Lake Taihu estimated by our models are consistent with the previous studies (Qi et al., 2014; Lary et al.,2016). In addition, the in-situ Chl-a used in this study is measured by YSI6600 instrument and fluorescence analysis method (Liu et al., 2010), which has the advantages of automatic acquisition and the results can be determined within 30 minutes. In contrast, the commonly used spectrophotometric analysis method is much less efficient, the method cannot automatically obtain results, and it may take nearly a day. Overall, the seasonal models,with its spatial resolution of 16 m, can retrieve the Chl-a with higher accuracy and reflect its distribution pattern and trend in the entire Lake Taihu.

Table 8. Model performances reported in some previous studies on estimation of Chl-a concentration in Lake Taihu using satellite

5.3 Limitations and future work

Here, we develop seasonal models to estimate Chl-a in inland lakes with a high spatial resolution of 16 m by using the direct measurements ofGF-1WFV TOA reflectance. Compared with the MODIS based models, the proposed models have a much higher spatial resolution and can provide more details about variations in Chl-a in Lake Taihu. However, there are some limitations and potential room for model improvements. Similar to other ML based methods, the RF derived algorithm is also a black box model, which is not easy to give an easy interpretation of decision procedure. Therefore, it is necessary to select more variables with physical significance for model training, like the indices from red color and near infrared channels. Furthermore, the slight difference in the center wavelength of the fourGF-1WFV cameras may affect the NDVI value. This difference may have been taken into account when training the models with artificial intelligence method, and the output results of the models seem to have a high accuracy. Nevertheless, we will further study the physical mechanisms of these subtle effects by using radiation transfer modes to obtain more accurate models in the future. The in-situ Chl-a measured by the fluorescence analysis method is used in the model development, having relatively smaller value than that of the spectrophotometric method (Liu et al., 2010), this may cause the Chl-a retrieved by our models are slightly lower. It is clearly that future efforts should be dedicated to compare with widely adopted methods [such as high performance liquid chromatography (HPLC) and spectrophotometric method] and evaluate its potential impact on the algorithm development to further improve the accuracy of the model. In addition, the limited number of samples and model run results also suggest the need for further improvement of our models. We anticipate collecting more in-situ measured Chl-a andGF-1WFV images and carrying out experiments for other lakes in the near future. On a broader scale, the approaches and findings of this study may be extended to other inland eutrophication lakes such as Lake Chaohu, Lake Dianchi, and Lake Erie. Our results will help managers and decision-makers account for and modify their strategies for controlling water eutrophication and cyanobacterial blooms in response to future climate change and human impacts.

6. Conclusions

Four seasonal Chl-a estimation models for Lake Taihu with a high spatial resolution of 16 m are developed based on the RF algorithm by directly usingGF-1WFV measurements of TOA reflectance and relevant in-situ measured water quality data. A novel variable importance evaluation comprehensive index (RIEI) is used instead of a single index (IMSE or INIP) to determine the optimalGF-1bands and band combinations for improving retrieval accuracy. Compared with the MODIS images with coarse spatial resolution of 250 m, the proposed models estimate Chl-a on fine spatial resolution(16 m) very well, especially in the seasons with high Chla. The four seasonal models have high performance during the sample-based cross-validation. MODAuthas the highest performance withR2(0.94) and RMSE (1.66 mg m−3), followed by MODSprand MODSum(R2values of 0.88 and 0.88 and RMSE values of 2.94 and 2.59 mg m−3, respectively). The results of independent sample tests show that these models accurately capture the fine features of the Chl-a distribution pattern and amplitude over Lake Taihu. The proposed model also reveals the temporal and spatial variation characteristics of Chl-a in Lake Taihu, which are related mainly to the distribution of water pollution and annual climatic alternation.

The results of this study also illustrate the unique advantages ofGF-1WFV data in relation to the observation of Chl-a, andGF-1WFV data are becoming increasingly important for monitoring China’s inland water bodies with high spatial resolution. Our study provides a new perspective regarding the acquisition of high-quality,high-resolution Chl-a using remote sensing technology.The application of the high-resolution Chl-a concentration provided in this study could further improve the prediction accuracy of cyanobacterial blooms. Furthermore,this study provides the government with information on the spatial distribution and temporal variations of Chl-a in Lake Taihu, which is of great importance regarding the ecological management of the water and ecosystems of Lake Taihu.

Acknowledgments. The authors would like to thank the Jiangsu Environmental Monitoring Center for providing the in-situ measured data of Chl-a concentration in Lake Taihu. The authors thank Mrs. Rongrong Hang for providing IT support and James Buxton MSc for editing the English text of this manuscript.