, , , ,2, ,2
(1. School of Computer Science and Technology, Fudan University, Shanghai 201203, China; 2. Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China)
Abstract: Acoustic analysis has great potential for clinical application because of its objective, non-invasive and low-cost nature. Auscultation is an important part of Traditional Chinese Medicine(TCM). By analyzing a voice signal, we attempt to diagnose the syndrome of the subject by labelling them normal or deficient. In this paper, we explore a Data Augmentation based Convolutional Neural Network(DACNN) for auscultation. The idea behind this method is the use of Convolutional Neural Network(CNN) on imbalanced data with data augmentation for automatic feature extraction and classification. We conduct experiments on our auscultation dataset containing voice segments of 959 speakers (346 males and 613 females), which were labeled by two experienced TCM physicians. We demonstrate the effectiveness of data augmentation to overcome the imbalanced dataset problem. We also compare its performance with traditional machine learning methods. By using DACNN, we achieve 97.25% diagnosis accuracy for females and 95.12% diagnosis accuracy for males, with 1%—10% improvement in accuracy and slight improvements in other indicators over traditional machine learning methods. The experimental results demonstrate that the proposed approach is helpful for objective auscultation diagnosis.
Keywords: acoustic analysis; traditional Chinese medicine; auscultation; convolutional neural network; machine learning method; data augmentation
Auscultation is to utilize the auditory sense to differentiate the patient’s syndromes or perform disease classification[1]. In the theory of TCM, pathology or disease would occur if the Ying-Yang balance was disturbed in the human body. Irregular vibrations of the human body system would be apparent in some parts of the body such as speech sound. In primitive TCM, the accuracy of auscultation largely depends on the doctor’s professional level. Hence it is often considered qualitative, subjective and unreliable to some extent due to lack of objective quantification, especially when compared to western medicine.
In recent years, the development of objective TCM diagnosis studies has relieved the issues mentioned above. Specifically, objective auscultation is often achieved by acoustic analysis. In general, most existing studies focus on extracting different acoustic features by signal processing methods. However, there is no specific feature corresponding to the process of acoustic diagnosis. For example, according to TCM theory, voice largely depends on Zang Qi; a normal person’s voice with sufficient Zang Qi usually sounds sonorous and steady, while a deficient person’s voice lacking Zang Qi is usually timid and weak. These characteristics may correspond to acoustic features such as energy, shimmer, Linear Predictive Cepatral Coefficient(LPCC), etc. We can utilize deep learning to perform automatic feature extraction which could avoid omitting features in some way.
Deep learning has been successfully used for classification tasks in many domains, such as computer vision and speech recognition. However, few studies have used deep learning for auscultation diagnosis. In this paper, we propose DACNN that is, utilizing data augmentation (adding noise) to overcome the imbalanced dataset problem and then using CNN for automatic feature extraction and classification. We expect convolutional layers to automatically learn high-level features and the fully connected layers to differentiate a patient’s syndromes into normal and deficient.
Most related works use traditional machine learning methods with extracted signal features. Chiu et al.[2]extract four acoustic parameters (temporal parameters: zero-crossing rates, variations on peaks and valleys, as well as spectral parameters: variations on peaks and valleys, spectral energy ratios) and classify syndromes into non-vacuity, moderate qi-vacuity and severe qi-vacuity through logistic regression. In their later work[3], they utilize a non-linear method (fractal dimension parameters) for auscultation, which is proved slightly better than their previous work. Yan et al.[4]utilize Support Vector Machine(SVM) to differentiate syndromes into health, Qi-vacuity, and Yin-vacuity with wavelet packet transform and approximate entropy. Later, Yan et al.[5]focus on non-stationarity of vocal signal, which uses non-linear cross-prediction to extract features. Furthermore, they proved that auscultation features based on the fractal dimension combined with wavelet packet transform were conductive to differentiate healthy, lung Qi-deficiency and lung Yin-deficiency[6].
In general, feature extraction by signal processing technique is the first step for traditional machine learning methods. On one hand, features correspond to TCM diagnosis principles such as zero-crossing rate, energy, jitter and shimmer[2-3]are often commonly selected. On the other hand, features frequently used in relevant areas (e.g. speech recognition, singer identification) are also considered, such as Mel-Frequency Cepstral Coefficients(MFCC)[7], LPCC[8]and LSP[9], etc.
For machine learning methods, the most commonly used method is SVM[10], which finds the optimal hyperplane that separates two classes maximizing the margin between separating boundary[4-6]. Gaussian Mixture Model(GMM)[11], boosting[12], random forest[13]and Auto-Associative Neural Networks(AANN)[14]are also commonly used methods in related tasks.
In this paper, we proposed DACNN for auscultation to differentiate a patient’s syndrome into normal and deficient. The overview of proposed DACNN method and traditional models can be seen in Fig.1.
Fig.1 The overview of proposed DACNN method and traditional methods
The number of both syndromes of our dataset is imbalanced. This issue will be introduced in detail in Section 3.1. There are two solutions:
(1) Weighting imbalanced data
This method attributes normal instances more weight than deficient instances. For example, the number of deficient male instances is 6 times that of normal male instances. Therefore, we give more weight to normal instances when assigning classification.
(2) Data augmentation
This method uses data augmentation techniques to generate new ‘data’ with little changes. Commonly used techniques in audio fields are: time shifting, pitch shifting, time stretching, noise adding, and so on. Because we will utilize pitch related features, and time shifting may distort some important patterns, we chose to add random Gaussian noise. Note that we will constrain the amplitude of the noise so that it mimics noise in the environment without significant impact on the original audio.
We use Short-Time Fourier Transform(STFT) to transform voice signals from time-domain into spectral-domain. In this pre-processing step, each recording was split into multiple 10 ms long segments (Hamming windowed) with 50% overlap. Then the spectrogram is reshaped to 513×250 points, removing the area with almost no information. An example of input feature maps can be seen in Fig.2. We utilize CNN to automatically extract high-level features that can differentiate normal and deficient syndromes from input feature maps.
Fig.2 The input feature map of a voice segment
For DACNN architecture, we use three stacks of convolutional layers to transform input feature maps into a high-level feature representation. A set of 16 kernels (5×5) is used to convolve the input feature maps with stride one. Then max-pooling is down-sampled by 4×4 shape filters to reduce the dimensionality of feature maps. We use Rectified Linear Unit(RELU) as activation function to make it non-linear and fit for classification. The second and third convolutional layers are almost the same as the first one except we use 32 filters to convolve. There are three fully connected layers ended with a Softmax layer for classification. We also combine cross entropy withL2 regularization to prevent over-fitting. The architecture of DACNN can be seen in Fig.3.
Fig.3 The architecture of proposed DACNN model
All data in our dataset was collected and labeled by a TCM institution in China. Each recording segment contains a normal pitch of vowel /a/ vocalization of duration about 1—3s. The recordings are sampled at 50kHz with 16-bit resolution. Each voice recording was labeled normal or deficient by two experienced and professional TCM doctors. We removed all the recordings which have inconsistent labels.
Tab.1 Detailed information of our dataset
Considering the different acoustic characteristics between genders, we split the dataset into female dataset and male dataset. Finally, we got a collection of 959 voice recordings, containing 346 males and 613 females. More detailed information of our dataset is listed in Tab.1.
Modern people are mostly of a sub-optimal health. The number of deficient people in our dataset is also much higher than that of normal people, which corresponds to our expectations. However, the imbalanced ratio between normal and deficient groups is an important issue we need to solve.
Considering different acoustic characteristics between genders, we perform experiments with female and male samples separately. We use a 10-fold cross-validation method and utilize the indicators of accuracy, precision, recall, and F1 value to measure the performance of our results. The indicators are calculated as follows:
(1)
(2)
(3)
(4)
whereTPrepresents true positive number,TNrepresents true negative number,FPrepresents false positive number andFNrepresents false negative number.
The dataset is divided into a training set and a testing set containing 70% and 30% samples respectively.
We utilize PyTorch framework to build and train our DACNN on GPU with a NVIDIA GTX1070. The dataset is split into mini-batches (50). Adam Optimizer is chosen with learning rate of 0.000 5. The decay weight ofL2 regularization is set to 0.000 1 and the maximum epochs of training is set to 100.
3.4.1 Comparison of two data balancing methods
Consistent with previous experiments, commonly used features such as zero-crossing rate, energy, jitter, shimmer, MFCC, LPCC and LSP are extracted and combined into an 89-dimentional feature representation vector for each voice recording. Then Principal Components Analysis(PCA) is used to remove irrelevant features and avoid over-fitting. Most previous studies utilize SVM as the classifier; here we also choose SVM (with no parameter tuning) as the baseline method to compare two data balancing methods (weighting unbalanced data and data augmentation) with naïve baseline model.
Tab.2 Detailed information of augmented dataset
For data augmentation method, we “create” some data for training as described in Section 2.1. Note that there is no augmented data in the testing set in order to evaluate the proposed method more accurately. The detail of augmented dataset can be seen in Tab.2.
The results of baseline model and two proposed models can be seen in Tab.3.
Tab.3 The comparison of data balancing methods
From the results we find that, with data balancing, almost all the indicators improve, which indicates that both data balancing methods work. The data augmentation method has better performance, especially for accuracy; hence we will use this method for our later experiments.
3.4.2 Comparison of proposed DACNN with traditional machine learning methods
Our baseline models are commonly used machine learning methods with extracted features.Firstly, we extract features that are the same as Section 3.4.1. For comparison, we choose various commonly used traditional machine learning algorithms, such as SVM, GMM, adaboost, random forest and AANN with optimal parameter settings. The comparison of different classifiers (with optimal parameters) is shown in Tab.4.
Tab.4 The comparison of data balancing method
From the table we find that, by using CNN, we achieved 97.25% diagnosis accuracy for females and 95.12% diagnosis accuracy for males, compared to 95.15% diagnosis accuracy for females and 94.29% diagnosis accuracy for males by using best performance traditional methods. Among traditional machine learning methods, SVM’s performance did not demonstrate the highest accuracy (92.79% for females and 92.31% for males). But compared to other methods, it demonstrated good F1 values (0.961 6 for female and 0.940 3 for male), which indicates strong generalization ability. Adaboost and the Random Forest method had high accuracy (95.15%, 93.60% for females and 94.29%, 93.57% for males respectively), both of which can handle samples with high dimensional features well. Owing to its symmetric topology network architecture, AANN performed well (95.13% for females and 90.01% for males).
The proposed DACNN method observed better performance over traditional machine learning methods on both male and female datasets. Its F1 value is high (0.970 0 for female and 0.950 4 for male), indicating great generalization ability. DACNN learned to recognize high-level features through its convolutional layers, and became adept at differentiating syndromes. Taken together, our results demonstrated the effectiveness of the proposed DACNN method.
In this paper, we proposed a DACNN method for differentiating the patient’s syndromes into normal and deficient. We perform experiments with female and male samples separately on our newly constructed dataset. We first compare two data balancing methods (data augmentation and weighting unbalanced data), and we demonstrate that data augmentation has better performance for the same classifier. Then we compare our proposed method with several traditional machine learning methods (SVM, GMM, Adaboost, random forest and AANN). The results show that DACNN achieved 97.25% diagnosis accuracy for females and 95.12% diagnosis accuracy for males, with 1%—10% accuracy improvement and slight improvements in other indicators. We demonstrate that, with high-level feature representation ability, the proposed DACNN method is helpful for objective auscultation diagnosis.
In the future, we plan to expand data set with high quality label. Secondly, considering that recordings with inconsistent labels were removed, the remaining data has better distinguishable degree. It is challenging and meaningful to explore audio with controversial labels. Besides, we will explore to model both local feature and the temporal dependency for auscultation. Furthermore, we will try to differentiate syndromes into normal, Qi-deficient, and Yin-deficient.