Deep Learning Multimodal for Unstructured and Semi-Structured Textual Documents Classifcation

2021-12-14 09:58NanyKatameshOsamaAbuElnasrandSamirElmougy
Computers Materials&Continua 2021年7期

Nany Katamesh,Osama Abu-Elnasrand Samir Elmougy

Faculty of Computers and Information,Department of Computer Science,Mansoura University,35516,Egypt

Abstract: Due to the availability of a huge number of electronic text documents from a variety of sources representing unstructured and semi-structured information,the document classifcation task becomes an interesting area for controlling data behavior.This paper presents a document classifcation multimodal for categorizing textual semi-structured and unstructured documents.The multimodal implements several individual deep learning models such as Deep Neural Networks (DNN), Recurrent Convolutional Neural Networks(RCNN)and Bidirectional-LSTM(Bi-LSTM).The Stacked Ensemble based meta-model technique is used to combine the results of the individual classifers to produce better results, compared to those reached by any of the above mentioned models individually.A series of textual preprocessing steps are executed to normalize the input corpus followed by text vectorization techniques.These techniques include using Term Frequency Inverse Term Frequency(TFIDF)or Continuous Bag of Word(CBOW)to convert text data into the corresponding suitable numeric form acceptable to be manipulated by deep learning models.Moreover, this proposed model is validated using a dataset collected from several spaces with a huge number of documents in every class.In addition, the experimental results prove that the proposed model has achieved effective performance.Besides, upon investigating the PDF Documents classifcation,the proposed model has achieved accuracy up to 0.9045 and 0.959 for the TFIDF and CBOW features,respectively.Moreover, concerning the JSON Documents classifcation, the proposed model has achieved accuracy up to 0.914 and 0.956 for the TFIDF and CBOW features,respectively.Furthermore,as for the XML Documents classifcation,the proposed model has achieved accuracy values up to 0.92 and 0.959 for the TFIDF and CBOW features,respectively.

Keywords: Document classifcation; deep learning; text vectorization;convolutional neural network; bi-directional neural network;stacked ensemble

1 Introduction

Due to the wide variety of the types of the documents circulating over the internet used in large scale of different applications, identifying the type of document is a critical task for the classifcation models in order to simplify further operations.Textual semi-structured and unstructured documents have many differences related to their nature which include the structure of the textual representation, degree of ambiguity, degree of redundancy, degree of using punctuation symbols and use of idioms and metaphors [1].Therefore, intensive preprocessing steps are required to get acceptable classifcation results through using textual representation techniques.

In addition, document classifcation is a process of effectively managing large volumes of documents through assigning one or more documents to a specifc class from a set of predefned classes.Formally, let D={d1,d2,...,dn} the set of all documents of a size n documents and C={c1,c2,...,cm} the set of predefned classes ofmclasses [2].The document classifcation task can be also modeled as f:D →C that assigns one document dito a specifc class, ci.Furthermore,it engages various felds including Natural Language Processing (NLP), machine learning and information retrieval to work altogether to conduct the classifcation of the textual resources [3].

Moreover, machine learning algorithms, such as Deep Neural Network (DNN) [4,5], Recurrent Neural Network (RNN) [4,5], Convolutional Neural Network (CNN) [4,5], Recurrent CNN(RCNN) [6,7], Long short-Term Memory (LSMT) model [4,8] and Bidirectional LSTM (Bi-LSTM) [9,10], are used to train the document classifcation models based on the word embedding feature vectors extracted from the textual documents.Besides, term Frequency Inverse Term frequency (TF-IDF) [11–15] and Continuous Bag-of-Words (CBOW) [16–19] are popular text vectorization techniques that generate hand-crafted feature vectors.

The main issue with the classifcation of text documents relates to the great diversity in the nature of documents that require special kinds of manipulations.Although there have been an increasing body of efforts using DL approaches for handling such issue, most of these approaches are designed for dealing with a certain type of data, while others have ignored the relationships between data that affect the expressive power of the extracted features.Thus, there is a need to develop a generic approach for textual documents classifcation across a wide range of data types with a variety of complex structures.

Therefore, this paper aims to develop an automatic document classifcation model for categorizing semi-structured and un-structured textual resources using the Deep Learning (DL)techniques based on various text vectorization techniques.Tokenization and various text normalization techniques are used at the preprocessing level.Furthermore, TF-IDF and CBOW are used at the feature level.Additionally, DNN, LSTM and Bi-LSTM are used at the classifcation level.

Furthermore, the remainder of this paper is organized as follows:The researchers highlight and summarize the related literature review in Section 2.Then, Section 3 discusses the proposed approach in details.Next, Section 4 presents the experimentation results.Finally, the conclusions are demonstrated in Section 5.

2 Literature Review

2.1 Document Classifcation Approaches

Document classifcation has two main different approaches:Manual and automatic classifcation.The frst approach is both expensive and time consuming.However, it provides the user with a great control over the process.The user identifes the relationships between documents and handles the classifcation issues.On the other hand, the second approach ends up in faster and more objective classifcation.It applies content-based matching of one or more predefned categories to documents.In addition, automatic document classifcation can be accomplished through using one of the following three classifcation models:Supervised, unsupervised and rule-based classifcation.

First, in the supervised learning classifcation, the training model is based on using a small training set of predefned input–output sample documents.This is in an attempt to generalize the categorization task and deduce the classifcation rules to precisely classify new emergency documents.

Second, in the unsupervised learning classifcation, patterns are discovered and documents are categorized based on similar words and phrases.The most similar documents are the ones that have more attributes in common.

Third, in the rule-based classifcation, a set of linguistic rules that defne the relationships between the input dataset and their associated categories are formulated and parsed.It is most suitable for predicting data containing a mixture of numerical and qualitative features.Moreover,it is very accurate for small document sets, where the classifcation results are always based on the predefned rules.However, the task of defning rules can be tedious for large document sets with many categories.

2.2 Related Work

In this sub-section, the researchers highlight the previous literature studies that covered the contributions of the researchers in various areas of research related to the classifcation process,including feature representation and vectorization and individual and multimodal classifcation.

2.2.1 Feature Representation and Vectorization

Huang et al.[20] have presented a statistical feature representation method that extracts the most descriptive terms in a document.It also assesses the importance of the word through counting the number of times it occurs in each document and assigning it to the feature space.This method ignores the semantic values of the words and word relationships in each sentence.Therefore, it leads to poor similarity results.

In addition, Melamud et al.[21] have presented context2vec neural architecture which uses word2vec’s CBOW architecture with a major enhancement achieved through implementing bidirectional LSTM instead of its native context modeling.This model is an unsupervised approach that handles embedding procedures based on large corpora and produces high quality word representation to learn a generic embedding function for variable length contexts.

Yang et al.[22] have also improved feature representation through getting the semantic and syntactic relations among words and providing rich dictionary resources that can cover all aspects of the NLP tasks.This model generates both defnitions and example sentences of target words.The experimental results prove that the model has achieved high performance with regard to both defnition modeling and usage modeling tasks.Nevertheless, it still needs more enhancements to generate more meaningful example sentences.

2.2.2 Individual Deep Learning Classifers

Yao et al.[23] have proposed a Graph Convolution Neural Network (GCN) method for text classifcation.It is used to achieve strong classifcation performances with a small proportion of labeled documents, interpretable words and document node embedding.This model consists of a knowledge graph, where each node refers to an object category and input represented as word embedding of nodes for predicting class.It also uses a single GCN layer with a larger neighborhood which includes both one-hop and multi-hops nodes in the graph to overcome oversmoothing.However, this method is weak with regard to learning representation on a large scale of unlabeled text data.

Moreover, Naqvi et al.[24] have developed a roman Urdu news headline classifer based on different individual machine learning techniques, Logistic Regression (LR), Multinomial Naïve Bayes (MNB), Long short term memory (LSTM) and Convolutional Neural Network (CNN), to classify news into relevant categories on which further analysis and modeling can be done.Firstly,the news dataset is collected using scraping tools.Then, a phonetic algorithm is used to control lexical variation and test news from different websites.The experimental results prove that the MNB classifer has achieved the best accuracy among the other mentioned classifers.

Yoon [25] has proposed a convolutional neural network model for sentence classifcation.This model uses a single convolution layer after extracting word embedding for tokens in the input sequence.It has achieved acceptable results on multiple benchmarks using several variants of hyperparameter tuning and static vectors, compared to other DL models that utilize complex pooling schemes.

Furthermore, Zhang et al.[26] have implemented character-level convolutional networks (ConvNets) for text classifcation.This model encodes characters using one-hot encoding scheme to convert each numerical categorical entry in the dataset into columns of either zeros or ones based on the number of categories.These encoded characters have been fed as inputs to the deep learning architecture with multiple convolution layers.This model proves that character-level convolutional networks achieve competitive results with regard to large scale datasets.

2.2.3 Multimodal Deep Learning Classifers

Zulqarnain et al.[27] have proposed a classifcation model based on a combination of Gated Recurrent Unit (GRU) and Support Vector Machine (SVM).They have replaced Softmax activation function in the output layer with GRU.This model has achieved remarkable results particularly when the size of the storage is limited.It has also overcome the issues of vanishing and explosion of gradient.

Haralabopoulos et al.[28] have proposed an automated sentiment classifcation model used to categorize human-generated content.This model consists of several multi-label DNN classifcation architectures and two ensembles.The frst architecture is a simple CNN with fully connected layers.The second architecture integrates a Gated Recurrent Unit (GRU) with a convolution layer.The third architecture implements TFIDF and a DNN with three fully connected layers.This model has made the best use of these articulated architectures to improve classifcation results without hyper-parameters tuning or data over-ftting.

Kowsari et al.[29] have also proposed a classifcation model called Random Multimodal Deep Learning (RMDL) that concatenates standard DL architectures in order to develop robust and accurate architectures for classifcation tasks.Their constructive model is based on three architectures:CNN, RNN and DNN.The output is generated using majority vote on output of these architectures.The results prove the effectiveness of this model.

Moreover, Ding et al.[30] have proposed a model with multi-layer RNN called Densely Connected Bidirectional LSTM (DC-Bi-LSTM) for text classifcation.It has used LSTM to encode a sequence of input.In each layer, the hidden states have been represented as a reading memory.This model has made improvements over the traditional Bi-LSTM, achieved high performance and improved information fow in large tasks.Besides, the researchers expect that the performance may be improved in case of including the implementation of dense Bi-LSTM module instead of the Bi-LSTM encoder.

Furthermore, Wang et al.[31] have proposed a classifcation model based on a combination of the Dynamic Semantic Representation model and the Deep Neural Network model (DSRMDNN).Firstly, it generates a model to capture the context of words and selects semantic words dynamically where each word’s attribute has been assigned a weight to be quantifed.Secondly, it has fed these features as elements to the text classifer that is composed of deep belief network and back-propagation neural network.This model improves the speed and accuracy of text classifcation, taking into consideration the value of the low-frequency words and new words.

In addition, Cire¸san et al.[32] have proposed a multi-model neural networks classifer that is composed of multi-column deep neural networks as combination architectures of DNN and Convolutional Neural Networks (CNN).Moreover, CNN empowers the DNN max-pooling layer by using feed-forward networks with convolutional layers to include local and global pooling layers and, hence, improve the classifcation results.

3 The Proposed Model

The proposed supervised automatic document classifcation model is adopted to categorize semi-structured and un-structured textual documents using DL techniques.It is decomposed of three subsequence stages:The textual data preprocessing, text vectorization and document classifcation.Fig.1 shows this proposed framework.

Figure 1:The proposed document classifcation framework

3.1 Textual Data Preprocessing

Once the data is imported from the corpus, it is automatically preprocessed to be suitable as an input to the classifcation model.Textual data preprocessing involves two basic steps:text tokenization and text normalization.Algorithm 1 illustrates the tasks required to be completed during the preprocessing process.

?

3.2 Text Vectorization

In order to convert the text data into the corresponding suitable numeric form acceptable to be processed by DL techniques, TFIDF and CBOW models are used to convert the raw text data into their corresponding numbers.

3.2.1 Term Frequency-Inverse Document Frequency(TF-IDF)

TF-IDF is a numerical statistic approach that aims to measure the importance of a word to a textual document in a corpus (i.e., dataset) [15].It also acts as a weighting factor in information retrieval and text mining issues.The higher the TF-IDF value is, the more the words will be in the document.

The TF-IDF weight assigns a weight to each term in a document depending on both its Term Frequency (TF) and its Inverse Document Frequency (IDF).It can be obtained through multiplying the values of the both terms, as given in Eq.(1).

where wi,jis TF-IDF value of word i in document j.TF refers to the ratio of the number of times a word occurred in a document to the total number of words in the document, which can be obtained by Eq.(2).

where fi,jis the frequency of word i in document j.njis the total number of words in document j.

IDF acts as a measure of how much information the word provides, it is calculated via Eq.(3).

where |D| is the total number of documents, |{d ∈D :i ∈d}|:is the number of documents containing the wordi; if a number of this term is zero, it becomes 1+|{d ∈D:i ∈d}|

3.2.2 Continuous Bag-of-Words(CBOW)Model

CBOW is a predictive DL model to map words to vectors and fnd out the word embedding.This is in order to capture contextual and semantic similarities [18].Let W ={wi−n,...,wi−1,wi,wi+1,...,wi+n}, CBOW tries to predict the target given its surrounding context words.It can be modeled as f :X →Y, where Y = wirepresents the target word while X=W −wirepresents the context surrounding words.

3.3 Textual Documents Categorization

This paper builds an effective document classifcation multimodal to categorize big corpus textual documents.This multimodal is a stacked ensemble combination of several individual DL techniques:DNN, RCNN and Bi-LSTM.Fig.2 shows the structure of the proposed classifcation multimodal.

3.3.1 Deep Neural Network(DNN)

The DNN architectures feed-forward multilayer architectures.The researchers’implementation of the DNN is basically as a discriminatively trained model that uses ReLU as an activation function.The input is a chain of word embedding features.Furthermore, the output layer houses neurons equal to the number of classes and uses Softmax function.

In addition, the data input (500×50) is generated from an embedding vectorization layer that has passed to fve consequent levels of hidden layers; and there are 512 nodes in each hidden layer.Each hidden level is decomposed of both a dropout layer and a dense layer.A dense layer represents a matrix vector multiplication of trainable parameters that implements the ReLU activation function, as given in Eq.(4).Moreover, a dropout layer has been used for setting the trainable parameters to be zero with probability.Next, the output layer of size 3 has been used,where the generative output is multi-class classifcation that uses softmax as an activation function,as stated in Eq.(5).

Figure 2:The proposed classifcation multimodal

3.3.2 Recurrent Convolutional Neural Network(RCNN)

This technique is a combination of RNN and CNN in order to capture the contextual information with the recurrent structure and to construct the representation of the text using the CNN technique.

The data input (500×50) is generated from an embedding vectorization layer that has passed to the hidden combination layer ofthe CNN and RNN techniques.The CNN consists of four consequent levels of convolution layers (4-Conv1D), with 256 flters with a kernel size=2.Besides, the ReLU activation function is followed by four consequent levels max-pooling (4-MaxPooling1D).The RNN consists of four consequent levels of LSTM (4-LSTM) with 256 number of nodes passed to the two levels of the dense layer using the ReLU activation function.After that, the output is generated using Eq.(5).

3.3.3 Bidirectional-LSTM

Bidirectional LSTMs (Bi-LSTMs) are an extension of typical LSTMs that are intended to enhance the performance of the classifcation model.Bi-LSTMs train two LSTMs instead of one LSTM on the input sequence.The frst provides feed-forward from the input sequence to the output, while the other provides feed-backward in a reverse order.The idea behind this technique is to allocate the forward state part to be responsible for the positive time direction and the backward state part to keep track of the opposite direction.

The data input (500×50) is generated from an embedding vectorization layer that has passed to the bidirectional layer.The bidirectional layer uses 100 memory cells in parallel in the both LSTMs to generate an output with a shape of 30 data points wide and 256 data points’ height.Next, the time distributed layer is used to generate an output shape with 30 data points wide and 256 data points’ height.The generated shape is passed to the fatten layer that produces an output shape of 7680 points; and that is fnally fed as an input to the dense layer to fnd the closest output class.

3.3.4 Stacked Ensemble Technique

This technique is intended to combine a set of previously trained models (DNN, RCNN and Bi-LSTM) and merge them with the concatenation function to generate the fnal classifcation outcome [33].

4 Experimental Results

4.1 Dataset Description

The training set consists of three textual classes:XML, JSON and PDF documents that are collected by web-crawling different websites.A total of 50.000 documents are randomly picked and allocated for JSON and XML classes, taken from the following websites:https://catalog.data.gov/dataset?res_format=JSON and https://www.sba.gov/sites/default/fles/data.json.For XML and JSON requests, an internal logger is used that collects 100.000 of such requests.Additionally,regarding the PDF class, the dataset consists of 11,228 newswires from Reuters labeled over 46 topics.

4.2 Evaluation Metrics

Multiple performance and evaluation criteria are used to ensure the improvement of the proposed model, in comparison to the other existing models.Precision [34] act as Positive Predictive Value (PPV), as stated in Eq.(6).

Recall [34] act as True Positive Rate (TPR), as given in Eq.(7).

F-measure [34] is calculated by the harmonic means between precision and recall as illustrated in Eq.(8).

4.3 Experiments

In this section, a series of experiments are done to evaluate the performance of the researchers’ revised individual classifers and the results of the proposed combined document classifcation multimodal.

4.3.1 Experimental Results of DNN Model

Tabs.1–3 illustrate the precision, recall and f-measure of the experimentation results of the individual DNN model for predicting PDF, JSON and XML documents, respectively.These results are based on the researchers’ suggested hyper parameters that include the following values:the numbers of epochs, the learning rate values, the batch size values and the numbers of hidden layers.First, Tab.1 illustrates the classifcation results for predicting PDF documents in the case of using the TFIDF and CBOW text vectorization techniques.Second, Tab.2 demonstrates the classifcation results for predicting JSON documents in the case of using the TFIDF and CBOW text vectorization techniques.Finally, Tab.3 shows the classifcation results for predicting XML documents in the case of using the TFIDF and CBOW text vectorization techniques.

Table 1:Classifcation results of DNN for predicting PDF documents

4.3.2 Experimental Results of the RCNN Model

Tabs.4–6 illustrate the precision, recall and f-measure of the experimentation results of the individual RCNN model for predicting PDF, JSON and XML documents, respectively.These results are based on the researchers’ suggested hyper parameters that include the following values:The numbers of epochs, the learning rate values, batch size values and the numbers of hidden layers.Tab.4 illustrates the classifcation results for predicting PDF documents in the case of using the TFIDF and CBOW text vectorization techniques.Moreover, Tab.5 clarifes the classifcation results for predicting JSON documents in the case of using the TFIDF and CBOW text vectorization techniques.Finally, Tab.6 displays the classifcation results for predicting XML documents in the case of using the TFIDF and CBOW text vectorization techniques.

Table 2:Classifcation results of DNN for predicting JSON documents

Table 3:Classifcation results of DNN for predicting XML documents

Table 4:Classifcation results of RCNN for predicting PDF documents

Table 5:Classifcation results of RCNN for predicting JSON documents

Table 6:Classifcation results of RCNN for predicting XML documents

Table 7:Classifcation results of Bi-LSTM for predicting PDF documents

4.3.3 Experimental Results of Bi-LSTM Model

Tabs.7–9 demonstrate the precision, recall and f-measure of the experimentation results of the individual Bi-LSTM model for predicting PDF, JSON and XML documents, respectively.These results are based on the researchers’ suggested hyper parameters that include different numbers of epochs, element vectors, batch size values and numbers of hidden layers.Tab.7 illustrates the classifcation results for predicting PDF documents in the case of using the TFIDF and CBOW text vectorization.Furthermore, Tab.8 shows the classifcation results for predicting JSON documents in the case of using the TFIDF and CBOW text vectorization techniques.Finally,Tab.9 clarifes the classifcation results for predicting XML documents in the case of using the TFIDF and CBOW text vectorization techniques.

Table 8:Classifcation results of Bi-LSTM for predicting JSON documents

4.3.4 Experimental Results of the Proposed Document Classifcation Multimodal

In addition, Tab.10 illustrates the precision, recall and f-measure of the classifcation results of the document classifcation multimodal for the unstructured PDF class, semi-structured JSON class and semi-structured XML class in the case of using the TFIDF and CBOW text vectorization techniques.The results indicate that the performance of the proposed multimodal based on the stacked ensemble technique gives better results, compared to those reached by any of those models individually.

The high results found by the study are due to applying the proposed technique, which is a combination of the RNN and CNN techniques.Actually, it makes use of the advantages of the both techniques.It is also intended to capture the contextual information with the recurrent structure.Moreover, it helps construct the representation of the text through using the CNN and Bi-directional Neural Networks that allocate the forward state part to be responsible for the positive time direction and the backward state part to keep track of the opposite direction.Finally, the researchers have used the stacked ensemble technique to combine a set of trained meta-models.The outputs of the previously trained models are merged with the concatenation function to generate the fnal classifcation outcome.Prior to that, the researchers made feature extraction using Word2Vec and TF-IDF Word2Vec to capture the position of the words in the text (syntactic) and to capture the meaning of the words (semantics).Therefore, word2vector,according to the achieved results above, shows the best outcomes.

Table 9:Classifcation results of Bi-LSTM for predicting XML documents

Table 10:Classifcation results of the multimodal based on the TFIDF and CBOW techniques

5 Conclusion

The classifcation task is an important issue with regard to machine learning, given the growing number and size of datasets that need sophisticated classifcation.Therefore, the researchers have proposed an automatic document classifcation multimodal for categorizing multi-typed textual documents.In addition, the proposed multimodal combines three individual classifers:DNN,RCNN and Bi-LSTM, based on the stacked ensemble technique.The purpose of adopting this multimodal is to make managing and sorting the textual documents easier.This is especially useful for publishers, fnancial institutions, insurance companies or any industry that deals with large amounts of content.Moreover, the proposed automatic document classifcation model realizes a signifcant reduction in the time consumed on manual data entry, in costs and also in the turnaround time for document processing.Additionally, it ends up in an accurate, effcient and more objective classifcation where it applies semantic classifcation based on deep learning classifcation.Furthermore, the evaluation results show that a combination of the models and the parallel learning architecture used has consistently resulted in accuracy higher than that obtained through using conventional approaches and individual deep learning models.

Finally, the researchers aim in future studies to empower the feature extraction and representation stage through using an effective glove technique.Moreover, the researchers intended to extend the feature level through embedding multivariate analysis and dimensionality reduction technique to specify which subspace the data approximately lies in and to fnd uncorrelated features.In addition, the researchers plan to develop a test data generative model for an automated testing tool and embed the proposed automatic classifcation model as a pre-integral part of the generative model to classify different kinds of documents before generating the test data for each type.

Funding Statement:The authors received no specifc funding for this study.

Conficts of Interest:The authors declare that they have no conficts of interest to report regarding the present study.