Combined multi-kernel support vector machine and wavelet analysis for hyperspectral remote sensing image classif ication
Many remote sensing image classifiers are limited in their ability to combine spectral features with spatial features. Multi-kernel classifiers, however, are capable of integrating spectral features with spatial or structural features using multiple kernels and summing them for final outputs. Using a support vector machine (SVM) as classifier, different multi-kernel classifiers are constructed and tested using 64-band Operational Modular Imaging Spectrometer II hyperspectral image of Changping Area, Beijing City. Results show that by integrating spectral and wavelet texture information, multi-kernel SVM classifiers can obtain more accurate classification results than sole-kernel SVM classifiers and cross-information SVM kernel classifiers. Moreover, when the multi-kernel SVM classifier is used, the combination of the first four principal components from principal component analysis and wavelet texture provides the highest accuracy (97.06%). Multi-kernel SVM is therefore an effective approach to improve the accuracy of hyperspectral image classification and to expand possibilities for remote sensing image interpretation and application.
OCIS codes: 100.4145, 100.5010, 100.7410.
doi: 10.3788/COL201109.011003.
Hyperspectral remote sensing image classification remains a contentious topic in remote sensing[1?3]. With the increasing amounts of data from airborne and satellite hyperspectral sensors, numerous bands and low training samples pose a “curse of dimensionality”[1?3] . To address this issue and to increase the stability of classifiers, some feature selection algorithms have been proposed. However, the proposed approaches, timeconsuming and scenario-dependent, usually lead to information loss[4]. In recent years, support vector machines (SVMs) have been widely used to classify hyperspectral images, demonstrating excellent performance in terms of accuracy, generalization, and robustness[5] .
According to related studies, SVMs have better performance compared with traditional classifiers[1,6,7]. This is observed when spectral information is used for a SVM classifier. Moreover, SVMs can account for both spectral information and spatial features; for example, a good framework of multi-kernels for hyperspectral image classification has been presented[5,8,9]. However, the researchers did not pay more attention to the parameter selection technique and more sophisticated texture techniques — two very important methods for improving the performance of multi-kernel approaches.
Many recent studies attempted to enhance texture analysis through different methods, such as texture classification[10], texture segmentation[11,12], and texture detection[11,13]. Various texture analysis techniques have also been developed. Traditional statistical approaches to texture analysis, such as co-occurrence matrices, second order statistics, and Gauss-Markov random fields[14] , are restricted to the analysis of spatial interactions over relatively small neighborhoods on a single scale.
The wavelet theory has developed rapidly since the1980s, and has been used in fields such as signal processing[11,15?17], image restoration[18], image retrieval[19], and pattern recognition[20], among others. Uses of wavelet theory on the texture analysis of remote sensing images have also been investigated[21] .
In this letter, SVMs, which have demonstrated superior performance in the context of hyperspectral image classification, are adopted as the classifier. A series of multi-kernel classifiers based on SVM are constructed to account for both spectral and spatial features (described by wavelet texture information).
The SVM theory for a two-class problem is found in some references[22,23]. Some popular kernel functions include linear kernel, polynomial kernel, Gaussian radial basis function (RBF) kernel, and sigmoid kernel.
Considering some common SVM kernels, the bottleneck is defined as the kernel mapping function that attains similar samples. In general, we can reconstruct a new kernel if it fulfils Mercer’s condition.
Let χ be any input space and K : χ×χ → R be a symmetric function. In the expression, K is a Mercer’s kernel if, and only if, the kernel matrix formed by restricting K to any finite subset of χ is positive semi-definite, having no negative eigen values.
We can then consider the convex combination of multiple kernels:
For definite pixel entity xi , if xi is recorded with x s i in the spectral domain and x t i is in the spatial domain after texture extraction, we can rewrite the foregoing multikernel function as
The multi-kernel equation constitutes a tradeoff between the spectral and spatial domains, which is actually implemented by designing the parameter of βk. The preceding kernel classifiers can be conveniently modified to account for the relationship between spatial and spectral information[22] .
Similar to the convex kernel, another kernel can be defined as[5]
where x s i and x s j must have the same dimension. It is named cross-information kernel. We use a polynomial kernel for Kst(x s i , xt j ) and Kts(x t i , xs j ) to decrease computational burden and simplify parameter selection. A detailed proof of this can be found in Ref. [5].
Wavelet analysis is based on the Fourier transform. Frequency analysis of stationary signals can be effectively achieved by projecting the signal onto a set of infinite spatial extent basis functions using the Fourier transform:
where X(f) represents the global frequency of the signal. Similarly, effective frequency analysis of non-stationary signals can be achieved by projecting the signal onto a set of spatially localized basis functions using the wavelet transform. This is achieved by combining the inner products of signal X(t) with the translation and dilation of a prototype function:
where a, b ∈ R, a is the dilatation factor and b is the translation factor. Here, ψa,b(t) is the translated and scaled version of the mother wavelet ψ(t) given by
Various choices of a and b result in different possible wavelet bases at different scales and translations. Notably, large values of the scale variable correspond to small frequencies and small lengths of the ψa,b(t) function. Conversely, small values correspond to low frequencies and large lengths of the ψa,b(t) function. This effectively shows that low frequencies need to be analyzed at large scales, and high frequencies at small scales. Haar, Daubechies, and Gabor are just a few of the wavelet functions.
From Mallat’s theory, image data can be broken down into orthogonal components and translation invariant using the discrete wavelet transform (DWT) method[5] .
A DWT example for image data is shown in Fig. 1. To extend the wavelet transform to two dimensions, it is only necessary to filter separately horizontally and vertically. This produces four sub-bands at each scale. Denoting the horizontal frequency first and the vertical frequency second, this produces high-high (HH), high-low (HL), lowhigh (LH), and low-low (LL) image sub-bands.
A typical way of characterizing a texture using DWT is to extract energy measures from each sub-band[24]. If the sub-band is x(m, n), then energy features from each sub-band can be calculated using the l1norm:
where M and N are the length and width of the subband, respectively; i and j are the rows and columns of the channel, respectively; and x is the wavelet coefficient. Each energy measure characterizes the magnitude of frequency content at the orientation and scale of each subband. Thereafter, each sub-band should be resampled to the original length and width of the image data.
By comparing the Haar wavelet and the Daubechies’ wavelet, we see that the latter has continuous derivatives responding well to discontinuities in the texture, while the former does not allow sharp transitions and fast attenuation. Moreover, the Haar wavelet cannot efficiently separate the image signals into low- and high-frequency sub-bands. In contrast, Daubechies’ wavelet constructs smooth scaling functions of compact support with orthonormal shifts. The DWT method can then be used to obtain smooth orthogonal wavelets. Many examples of textured image analysis, comparison, and segmentation have shown that Daubechies’ wavelet is a successful technique[25,26]. Daubechies’ wavelet transform (WT) based texture analysis produces the best result in invariant texture classification[27,28] .
For texture extraction, principal component analysis (PCA) was first conducted. The first four principal components (PCs) were chosen for wavelet texture extraction because they contain most of the information (about 90%) from the original data. Window size of the wavelet texture was also significant. We tested 3×3, 8×8, and 16×16 window sizes, finally adopting the 8×8 window size. Figure 2 shows the original remote sensing image and the two-dimensional (2D) wavelet texture images, which have seven different texture images as the seven vectors. To obtain the wavelet texture, one-dimensional (1D) DWT and 2D DWT were applied. As the extension of 1D DWT, 2D DWT can be carried out by the tensor product of two 1D wavelet base functions in the horizontal and vertical directions. This produces HH, HL, LH, and LL image sub-bands. Using the above wavelet theory, we obtained four different dimensions data for both 1D DWT and 2D DWT. These four dimensions were named wavelet texture 1 (WT1) and wavelet texture 2 (WT2).
The data used in the experiment were the Operational Modular Imaging Spectrometer II (OMIS II) hyperspectral image of Changping Area, Beijing City, China. The image has 512 rows, 512 columns, and 64 bands ranging from 0.46 to 1.1 μm. It has a spatial resolution of 3 m. Preprocessing was conducted by the data provider. Figure 2(a) is the RBG composite image of the hyperspectral image (R: Band 36 with a wavelength of 0.81 μm; G: Band 23 with a wavelength of 0.68 μm; and B: Band 11 with a wavelength of 0.56 μm). In Fig. 2(a), the black region represents the fish pond area, the yellow (light gray in Print Edition) region represents the yellow grass area, and the white region represents the inhabited area. 16 bands of noise data were removed.
We chose spectral data, PCA data, and wavelet texture for multi-kernel SVM classification. After we applied PCA, the first four PCs were chosen[3]. In the following experiments, sole-kernel, multi-kernel, and crossinformation kernel were constructed. Figure 3 shows the flow chart of the algorithm. For simplicity, 0.5 was assigned as βk in both features. As the penalization and RBF parameters were difficult to choose, we used an improved method in all cases. First, we set the penalization range C = {10?1 , 1, 10, · · · , 103}, then we searched each RBF parameter δ in the range δ = [10?1 , · · · , 102 ]. Although time-consuming, the method worked well. We used one-against-one multi-classification for SVM.
After normalizing all the data, training and test samples were selected. Taking into account all spectral and texture features, the pixel purity index (PPI) of the data was computed. One thousand purity pixels were obtained and chosen as samples. In the experiments, these purity pixels, together with our ground truth, were taken as training and test samples.
The classification problem involved the identification of six land cover types for the OMIS II data set: C1: inhabited area (89 training pixels), C2: crop land (92 training pixels), C3: plant (88 training pixels), C4: road (84 training pixels), C5: vegetable (74 training pixels), and C6: water (101 training pixels). The training set was 0.201% of the original dataset. In addition, the test samples, which were obtained in the same manner, were about 0.168% of the dataset: C1: inhabited area (71 pixels), C2: crop land (69 pixels), C3: plant (76 pixels), C4: road (72 pixels), C5: vegetable (73 pixels), and C6: water (79 pixels).
For the sole-kernel SVM classifier, only spectral features were used. In all experiments, we used the RBF kernel for both spectral and texture features because of its superiority over other kernel functions[28]. Finally, different features were added in one dataset: spectral dimensions, the first four PCs of PCA, WT1, and WT2. Table 1 shows the classification accuracy and Kappa coefficients[3,8] .
With multi-kernel classification, we have different combinations, such as spectral data and wavelet data, four PCs, wavelet data, etc. However, we used the RBF kernel in each dataset. Table 2 shows the classification accuracy and Kappa coefficients.
Finally, the cross-information kernel was applied. However, different data should have the same dimensions; hence, the four PCs and wavelet textures were adopted. Table 3 shows the classification accuracy and Kappa coefficients. Figure 4 shows that the classification results of different SVM kernels in each experiment are the most accurate.
From Tables 1–3, all multi-kernel SVMs obtained better accuracy than sole-kernel SVM and cross-information kernel. The best result was produced by the summation kernels, which used the first four PCs of PCA and WT1 for classification, yielding a total accuracy of 97.06% and a Kappa coefficient of 0.9472. We also compared the computing time for different kernel functions and ran each algorithm 10 times. The average time was chosen as the final training time of each algorithm. The training time of the proposed algorithm was 196.9 s. When crossinformation kernel was used, the combination of the four PCs and WT1 also obtained the best results in its group, but the training time was 440.3 s. For sole-kernel SVM, the best classification accuracy was 96.36% when spectral data and WT1 data were applied, and the training time was 164.5 s. In short, good classification results will be ensured if the appropriate multi-kernels are selected. Specifically, summation kernels can reach the best accuracy and consume shorter training times.
In conclusion, SVM classifiers using different kernels are applied to hyperspectral data to test their applicability. Extracted features are shown to improve SVM classification, thus presenting an alternative to what Hughes terms “the curse of dimensionality”. In these cases, multi-kernel approaches show results superior to those achieved by sole-kernel approaches. With good results, more sophisticated cross-kernel approaches are applied in our experiments. Note that the parameters are selected using an algorithm. Results from these experiments show that wavelet texture classification is a recommended application for use with hyperspectral data. Future work should consider multi-temporal data and the kernel-based theory.
This letter was completed with the support of the Center for International Earth Science Information Network (CIESIN), Columbia University, USA. This work was supported by the National Natural Science Foundation of China (Nos. 40401038 and 40871195), the National HighTech Program of China (No. 2007AA12Z162), Jiangsu Provincial Innovative Planning (No. CX08B 112Z), and the Fundamental Research Funds for the Central Universities (2010QNA18).