Shiqi Tian·Shijie Wang·Xiaoyong Bai·Dequan Zhou·Qian Lu·Mingming Wang·Jinfeng Wang
Abstract In order to obtain Pb content in soil quickly and efficiently,a multivariate linear regression (MLR) and a principal component regression (PCR) Pb content estimation model were established on the basis of hyperspectral techniques,and their applicability in different soil types was evaluated.Results indicated that Pb exhibited strong spatial heterogeneity in the study area,and more than 82%of the samples exceeded the background value.In addition,the pollution range was large.Pb was sensitive in the nearinfrared band,and the correlation of absorbance (AB) was most significant of all the transformed forms.Both models achieved optimal stability and reliability when AB was used as an independent variable.Compared with the PCR model,the stability,fitting accuracy,and predictive power of the MLR model were superior with a coefficient of determination,root mean square error,and mean relative error of 0.724%,24.92%,and 28.22%,respectively.Both models could be applied to different soil types; however,MLR had better applicability compared with PCR.The PCR model that distinguished different soil types had better reliability than one that did not.Thus,the model established via hyperspectral techniques can achieve largearea,rapid,and efficient soil Pb content monitoring,which can provide technical support for the treatment of heavy metal pollution in soil.
Keywords Hyperspectral data·Heavy metal·Pb·Estimation
A large amount of domestic and industrial wastes have been discharged into soils.Chemical fertilizers are also used in agriculture.Therefore,the soils accumulate heavy metal Pb,which enters the human body through the food chain and other means and poses a serious threat to human health (Li et al.2013a; Song et al.2017).Therefore,the estimation of soil Pb content is crucial.
Traditionally,the determination of heavy metal content in soil requires extensive sampling and long-term laboratory measurement analysis.This method has high measurement accuracy but is costly in terms of time and economy and has low efficiency; monitoring the spatial distribution information of large-area and continuous elements is difficult and may damage the soil environment(Ren et al.2009; Song et al.2015; Wang et al.2018).Hyperspectral technology can effectively estimate heavy metal content in soil in a nondestructive manner while saving substantial time and money and considerably improve estimation efficiency(Gholizadeh et al.2015;Guo et al.2015; Tao et al.2016).Some scholars have achieved the indirect inversion of heavy metal content by using the intrinsic relationship between heavy metals and soil components,such as organic matter,iron manganese oxide,and clay minerals (Sun et al.2018; Lu et al.2019).Although good results have been obtained,the degree of heavy metal adsorption by these soil components is often affected by differences in heavy metal elements and soil conditions.Therefore,the applicability of the model in studies of other elements or other soil conditions has to be examined further.Existing heavy metal prediction models are divided into linear and nonlinear models.Some scholars have predicted heavy metal Pb content based on nonlinear models,such as neural networks(Tian et al.2019),random forests(Tan et al.2020),and support vector machines(Tan et al.2014).Although the results are satisfactory,the construction of these models often requires programming,which is difficult and prone to overfitting,thus resulting in inaccurate results.Other scholars have estimated Pb content using linear models,such as multiple linear regression(MLR),multiple linear stepwise regression,and partial least squares regression(Tan et al.2014;Zhang et al.2014;Yao et al.2014; Maliki et al.2014),which have also achieved the desired results.Moreover,the linear model is relatively easy to develop,simple to calculate,and easy to operate.
The current study on soil Pb content focuses on nonkarst ecosystems; the soil distribution in karst areas is discontinuous,and spatial heterogeneity is strong (Zhang et al.2010),making studying soil Pb content difficult.Therefore,research on soil Pb content in karst areas is relatively lacking.In addition,several problems arise in the application of hyperspectral technology in a karst region.On the one hand,the hyperspectral data commonly used today come from three types of remote sensing satellite imagery and field and laboratory spectrometers.The terrain of a karst area is complex and diverse,and rocks are exposed to nakedness.The low signal-to-noise ratio of the sensor and the complex error source may affect the quality of the field spectrum and remote sensing image,making effective soil spectral information difficult to obtain and current research strongly dependent on laboratory spectra.Moreover,spectral data based on laboratory analysis are prone to errors due to external factors and defects in the instrument itself.Therefore,preprocessing the soil spectrum and optimizing the hyperspectral signal are necessary to improve prediction accuracy.Whether the model established by indoor spectral data can be applied to field and remote sensing images remains unclear due to the regional uniqueness of soil and the uncertainty of the field environment.On the other hand,current research is either conducted on a single soil type or does not consider the soil type,and the applicability of the model under different soil types has yet to be studied.
In view of the above limitations,this study took the Houzhai River Watershed in Guizhou Province as a case study.On the basis of the measured Pb content and hyperspectral data,an MLR and a principal component regression (PCR) estimation model was built.Moreover,the applicability of these two models in different soil types was evaluated.The aim was to explore the possibility of using hyperspectral techniques to estimate Pb content in karst areas and provide new ideas for the rapid monitoring of heavy metal elements in the soil of other ecosystems.
Located in the west of the Qianzhong Plateau,the Houzhai River Watershed has a well-developed karst landform,which is the watershed between the Yangtze and the Pearl River systems(Li et al.2013a,b).Its latitude and longitude ranges are 105°41′E–105°48′E and 26°13′N–26°17′N,respectively.The elevation is 1220–1560 m and the watershed area is approximately 75 km2.The terrain in the watershed is high in the southeast and low in the northwest.Land use includes forest land,cropland,grassland,construction land,and waters.The upper reaches are mainly forest land,whereas the middle and lower reaches are mainly cropland.The watershed has many soil types,including limestone,paddy,and yellow soils,and the spatial distribution pattern is complex (Zhang et al.2018).
A total of 98 soil samples of different soil types weighing approximately 1 kg were collected in the watershed.The sampling depth was 0–20 cm,and the range was wide(Fig.1).Each sample was divided into two parts after natural air-drying and removal of impurities.The samples were ground and filtered through a 200-mesh nylon sieve.One part was used for chemical analysis,and the other was for spectral analysis.
The soil samples were subjected to microwave digestion with hydrochloric–nitric–perchloric acid,and Pb content was determined via inductively coupled plasma mass spectrometry (PerkinElmer,Canada).Quality control was performed using the standard samples (AGV-2=12.191 mg/kg; AMH-1=9.822 mg/kg; GBPG=13.054 mg/kg) to ensure the quality of the analysis.Spectral determination of soil samples was performed in the laboratory using a Cary 5000 UV–Vis–NIR spectrophotometer (Agilent Technologies,USA).The band ranged from 500 to 2500 nm,the sampling interval was 1 nm,and three spectra were collected for each soil sample.Unscrambler software was used to perform principal component analysis on 98 samples,thus eliminating two outliers; the average of the spectral reflectance of 96 samples was taken as the original reflectance spectral value(Xu et al.2011; Li et al.2017).
Fig.1 Overview of the study area and distribution of samples.a distribution of samples,b elevation,c soil types and d land use types
The spectrometry process was susceptible to errors due to random factors.In contrast to the number of samples,the process had many variables,and the overlap of information between adjacent bands caused spectral data redundancy and some interference to the data analysis.The original spectral data were preprocessed via Savitzky–Golay smoothing,resampling (RE),continuum removal (CR),reflectance first derivative (RFD),reflectance second derivative (RSD),absorbance (AB),first derivative of absorbance (AFD),and second derivative of absorbance(ASD),thereby effectively reducing the influence of noise,purifying the spectral information,and reducing errors(Fig.2)(Shi 2014;Peng et al.2014;Ma et al.2016;Wang et al.2018; Tu et al.2018).
Fig.2 Spectral data smoothing and resampling.a Original spectral curve,b spectral smoothing and c spectral resampling
The soil samples were sorted in accordance with the Pb content in descending order.One out of every three samples was used as a validation sample (a total of 32,accounting for 33.3% of the total samples),and the remaining samples were used for calibration(a total of 64,accounting for 66.7%of the total samples).The correlation between Pb content and spectral data after RE,RFD,RSD,CR,AB,AFD,and ASD was analyzed,and the feature bands were screened.With the feature bands and soil Pb content taken as the independent and dependent variables,respectively,the MLR and PCR models were built.The applicability of both models in different soil types was evaluated.The model results were evaluated using the coefficient of determination R2and root mean square error(RMSE).The larger the R2,the better the stability of the model; the smaller the RMSE,the higher the accuracy of the model.The mean relative error (MRE) and 1:1 line were used as criteria for judging the estimation capability of the model.The smaller the MRE,the closer the sample point was to the 1:1 line,and the stronger the estimation capability of the model.The calculation formulas for R2,RMSE,and MRE are as follows:
where Ypis the predicted value,Ymis the measured value,is the average value of the measured values,and n is the number of samples.
The calibration and validation samples are similar in average and standard deviation and the range is relatively uniform and balanced(Table 1).The total sample set has a coefficient of variation of 0.60.The distribution of Pb in the soil is not uniform,showing significant spatial heterogeneity.Seventy-nine out of the 96 samples exceed the national standard natural background value (China National Environmental Monitoring Center 1990),that is,82% of the sample content exceeds the standard,and the total average value is 1.8 times the national background value.This result indicates that the soil in the Houzhai River Watershed is heavily polluted by heavy metal Pb,and corresponding treatment measures must be taken.Soil types comprise 27 yellow,23 paddy,and 46 limestone soils.The overall content of Pb in yellow soil is the highest,followed by limestone and paddy soils.In terms of land use,the content of grassland is the highest,followed by cropland (Fig.3).
The ‘‘burr’’ after smoothing by Savitzky–Golay and RE is significantly reduced,and the spectral information is significantly improved (Fig.2).The spectral curve of the soil shows an upward trend.Three significant absorption bands are observed at approximately 1400,1900,and 2200 nm,and the degree of absorption is slightly different.This result is consistent with the study of Teng et al.(2016).
The spectral characteristics of the soil in the study area are the same,and the spectral curves of different soil types are slightly different (Fig.4).The average absorbance of different soil types follows the order paddy soil >yellow soil >limestone soil (Fig.4d).The average reflectance of soil samples follows the order limestone soil >yellow soil >paddy soil.Studies have shown that the use ofspectral information can indirectly predict heavy metal content (Song et al.2015); thus,the screening of spectral feature bands is an important step in the establishment of a Pb content prediction model.The absorption bands of the spectral curves after CR are concentrated at approximately 530,950,1410,1910,2200,and 2260 nm (Fig.4a).The absorption bands of the spectral curve after RFD are concentrated near 630,810,1390,1440,1900,2160,2250,2310,and 2440 nm (Fig.4b).The absorption bands of the spectral curve after RSD are concentrated near 590,800,1380,1440,1870,2140,2240,2300,and 2440 nm(Fig.4c).The absorption bands of the spectral curve after AB conversion are concentrated at 790,1410,1920,2220,and 2260 nm (Fig.4d).The absorption bands of the spectral curve after AFD and ASD conversion are the same as that of RFD and RSD.
Table 1 Descriptive statistics of soil Pb content (mg/kg)
Fig.3 Statistics of total Pb content in the soil.a Statistics of Pb content in different soil types and b statistics of Pb content in different land-use types
The water absorption bands are generally around 1400 and 1900 nm.The absorption bands of soil organic matter are at approximately 500,600,800,2200,and 2300 nm.To avoid interference with modeling,the bands between 2400 and 2500 nm are not considered as feature bands due to the large noise (Wang et al.2007; Cheng et al.2017).Finally,combined with the six variations,the feature absorption bands of the soil spectrum are concentrated near the nearinfrared bands of 1440,1870,2140,2160,2240,2250,and 2260 nm.
Fig.4 Spectral curve transformation.a CR,b RFD,c RSD,d AB,e AFD and f ASD
The Pearson method was used to analyze the correlation between the Pb content and the spectral values of the feature absorption bands of the six transformed forms.There no significant correlation between the spectral values corresponding to the seven bands and the Pb content after CR(Table 2),So CR should not be considered as a variable when modeling.In addition,the correlation coefficient after AB conversion is improved compared with the original reflectance,and the overall correlation is strongest,indicating that the spectral data after transformation have the highest correlation with the Pb content.To improve the accuracy of the model and achieve better estimation results,this study selected the band that is significantly correlated at the 0.01 level as the spectral feature band of heavy metal Pb in soil.The feature bands of RE,RFD,AB,and AFD are at 1870,2140,2160,2240,2250,and 2260 nm.The feature bands of RSD and ASD are at 1440,1870,2140,2240,2250,and 2260 nm.
3.3.1 Modeling results
The spectral value of the spectral feature band and the Pb content were taken as the independent and dependent variables,respectively,to establish the models.R2in six variants is greater than 0.5 (Table 3),indicating that the MLR model is reliable and can be used to estimate Pb content (Wang et al.2014; Tao et al.2018).The model with AB conversion has the best effect.The R2of the calibration and validation sets are 0.649 and 0.799,respectively,and RMSE are 24.805 and 25.034,respectively.Although the calibration set of ASD has the largest R2and the smallest RMSE,the stability and accuracy of the verification set are far less than the model built by AB.Moreover,the accuracy of the model built by RE is much higher than that of several variables other than AB.The stability and accuracy of the model constructed with different spectral variables follow the order AB >RE >ASD >RSD >AFD >RFD.
Table 2 The correlation coefficients between soil heavy metal Pb content and spectral variables
Similar to the modeling results of the MLR model,the effect of the PCR model is the best in modeling the spectral variables after AB transformation,and the modeling effect with RE as variable is second (Table 4).The model constructed with AB as variable,regardless of whether it is the calibration set or the R2and RMSE of the validation set,reaches the maximum and minimum.The model effects constructed from different spectral variables follow the order AB >RE >ASD >RSD >RFD >AFD.
3.3.2 Accuracy comparison
In summary,the MLR and PCR models are the most stable models constructed with AB as independent variables.Overall,the stability and fitting effects of the MLR model are superior to the PCR model.A scatter plot of the measured and predicted values of the MLR and PCR models built with the AB as an independent variable is created.The closer the sample point is to the 1:1 line (i.e.,Y=X line),the closer the estimation result of the model is to the true value,and the stronger the estimation capability.Moreover,MRE is the reference indicator.The smaller the value of MRE,the stronger the model prediction capability.
The overall MRE of both models is less than 30%,which means that the estimation accuracy is higher than 70%,and all models have certain soil Pb metal estimation capability (Fig.5).The prediction capability of the MLR model is relatively stronger than that of the PCR model.The sample point is closer to the 1:1 line,and the MRE value is low.The MRE is 28.22%,and the prediction accuracy reaches 71.78%.
Table 4 The PCR model of soil total Pb content
Fig.5 Comparison of measured and predicted Pb content.a Prediction of MLR model with AB as independent variable and b prediction of PCR model with AB as independent variable
3.3.3 Analysis of the applicability of models in different soil types
The above studies confirm that both models have certain reliability,but the applicability of these models in different soil types need further analysis due to a noticeable soil differentiation in the karst area.The MLR and PCR models are built by distinguishing three different soil types,namely,yellow,lime,and paddy soils,with the best predicted AB as an independent variable.The R2of the MLR and PCR models of the three soil types are greater than 0.6,indicating that both models have certain applicability and can be used to predict Pb content in different soil types(Fig.6).Overall,the prediction effects of the MLR and PCR models in different soil types are not much different.However,the R2of the MLR model is larger than that of the PCR model,indicating that the prediction effect of the MLR model on different soil types is better than the PCR model,especially for the prediction of limestone soil.The R2,RMSE,and MRE of the MLR model are better than those of the PCR model,which are 0.747%,22.28%,and 25.37%,respectively.That is,the applicability of the PCR model is not as strong as the MLR model.However,the prediction accuracy of the PCR model slightly improves after distinguishing the soil type compared with the previous one,and the R2increases from 0.581 to over 0.614,with a maximum value of 0.709(Table 4).When using the PCR model for Pb content prediction,an improved effect can be achieved by distinguishing soil types for prediction.
Fig.6 Applicability of MLR and PCR models based on different soil types.a The MLR model of yellow soil,b the PCR model of yellow soil,c the MLR model of limestone soil,d the PCR model of limestone soil,e the MLR model of paddy soil and f the PCR model of paddy soil
The MLR model can prevent the omission of effective independent variables,and the PCR model can effectively solve the collinearity problem between independent variables (Xu et al.2013; Ye et al.2017).These models are widely used in spectroscopy prediction research because they are simple,easy to develop,and low cost(Wang et al.2017).In this study,the two methods are used to model heavy metal Pb content in a small watershed of a typical karst plateau,with the spectral feature band as an independent variable.The extraction of spectral information of heavy metals in soil is difficult,and the modeling accuracy is easily subjected to interference from other composition information because soil heavy metals are not the dominant factor controlling spectroscopy remote sensing information.Therefore,removing the bands that are susceptible to interference from other information is the key to establishing a hyperspectral model (Zhang et al.2015).First,several mathematical transformations are performed on the spectral reflectance to remove the interference of the noise on the spectral information to simplify the finding of the reflection valley and the absorption peak of the soil spectrum.Second,the bands that may interfere with information,such as moisture and organic matter,are removed,and Pearson correlation analysis is used to screen the spectrally sensitive bands of Pb content in the near-infrared,which is similar to the previous studies (Jiang et al.2017; Zhang et al.2017).Finally,a hyperspectral quantitative estimation model between spectral variables and soil heavy metal Pb is built.The R2,RMSE,and MRE indicators are used for modeling and evaluation.The best stability and reliability are achieved when the models are built on the basis of AB;this finding is consistent with the results of the relevant analysis.Except for AB,the model effect created by other transformed spectral variables is not better than the model established by the original reflectivity; thus,whether spectral transformation is required should be determined in accordance with the actual situation (Wu et al.2014).
Studies on heavy metals concentrated in the soil of karst areas are relatively few.The estimation accuracy of the models constructed in this study meets the requirements of soil heavy metal Pb,thus providing some ideas for research on similar areas.Moreover,the results of this study provide certain technical support for solving the problem of soil heavy metal pollution in karst areas.A large spatial heterogeneity is observed in soil heavy metal content owing to the complex geological background in a karst region.Future research will consider the effect of other factors on soil heavy metal content in karst areas and explore the influence of heavy metal pollution on human health through isotope tracing.
The average value of soil Pb content in the study area is 1.8 times the national background value,and the variation level is high,indicating that the soil in the Houzhai River Watershed shows strong spatial heterogeneity.Moreover,heavy metal Pb pollution is serious,and corresponding treatment measures must be taken.
Soil Pb elements are sensitive in the near-infrared region and are concentrated near several bands of 1440,1870,2140,2160,2240,2250,2260,and 2350 nm.After mathematical transformation,except for CR,the other spectral variables are significantly correlated with soil Pb,and the correlation of AB is the most significant.
The MLR and PCR models are the most reliable models with AB as independent variable.The overall R2and RMSE reach the maximum and minimum,respectively,which is consistent with the results of the correlation analysis.The stability,fitting effect,and prediction capability of the MLR model are better than the PCR model.
The MLR and PCR models have certain applicability in different soil types,and the applicability of the MLR model is stronger.When using the PCR model for prediction,distinguishing different soil types can achieve improved results.
In the present study,the model constructed using hyperspectral technology can predict heavy metal Pb content in the soil,which is crucial for large-area and largescale rapid monitoring of soil heavy metal content.In future research,the use of airborne and spaceborne hyperspectral images will be attempted to seek fast and broad breakthroughs in the monitoring of heavy metals in soil.
AcknowledgementsThis research work was supported jointly by National Key Research Program of China (Nos.2016YFC0502300 and 2016YFC0502102),Chinese Academy of Science,and Technology Services Network Program (No.KFJ-STS-ZDTP-036),International Cooperation Agency International Partnership Program(Nos.132852KYSB20170029,2014-3),Guizhou High-level Innovative Talent Training Program ‘‘Ten’’ Level Talents Program (No.2016-5648),United Fund of Karst Science Research Center (No.U1612441),International Cooperation Research Projects of the National Natural Science Fund Committee (Nos.41571130074 and 41571130042),Science and Technology Plan of Guizhou Province of China (No.2017–2966).
Conflict of interestOn behalf of all authors,the corresponding author states that there is no conflict of interest.