Hierarchical Stream Clustering Based NEWS Summarization System

2022-11-09 08:15ArunManickaRajaandSwamynathan

Computers Materials&Continua 2022年1期

M.Arun Manicka Raja and S.Swamynathan

1Department of Computer Science and Engineering,RMK College of Engineering and Technology,Chennai,602106,India

2Department of Information Science and Technology,College of Engineering Guindy,Anna University,Chennai,600025,India

Abstract: News feed is one of the potential information providing sources which give updates on various topics of different domains.These updates on various topics need to be collected since the domain specific interested users are in need of important updates in their domains with organized data from various sources.In this paper,the news summarization system is proposed for the news data streams from RSS feeds and Google news.Since news stream analysis requires live content,the news data are continuously collected for our experimentation.The major contributions of this work involve domain corpus based news collection,news content extraction,hierarchical clustering of the news and summarization of news.Many of the existing news summarization systems lack in providing dynamic content with domain wise representation.This is alleviated in our proposed system by tagging the news feed with domain corpuses and organizing the news streams with the hierarchical structure with topic wise representation.Further,the news streams are summarized for the users with a novel summarization algorithm.The proposed summarization system generates topic wise summaries effectively for the user and no system in the literature has handled the news summarization by collecting the data dynamically and organizing the content hierarchically.The proposed system is compared with existing systems and achieves better results in generating news summaries.The Online news content editors are highly benefitted by this system for instantly getting the news summaries of their domain interest.

Keywords: News feed;content similarity;parallel crawler;collaborative filtering;hierarchical clustering;news summarization

1 Introduction

Knowledge identification from online news articles have received keen attention among the news readers,especially from the Really Simple Syndication (RSS) feed-based news updates and Google news [1].The knowledge extracted from various news sources are mapped into many day-to-day applications.Various events are identified from news articles and the summaries are generated about a particular event with respect to different timelines [2].The news events are extracted by identifying the named entities present in the news content.The abstractive and extractive summaries are generated using summarization techniques such as abstractive and extractive summarizations [3].The semantic relevance is estimated using the wordnet and the hierarchical structure is represented for news articles [4].Single news article contains many keywords related to a particular topic.It is necessary to identify the domain of the keywords by tagging the keywords present in the news.Though the keywords are tagged in the news content,it is important to organize the content in a hierarchical structure for retrieving the similar news content for summarizing to the users.

In this work,a news clustering based summarization system is proposed to cluster various category of news content from multiple news sources and to generate news summaries on user interested topic.The proposed system is distinctive in handling the news updates for effectively organizing the news content to retrieve it later.Further,the extractive summary of the specific topic is generated from the clustered news contents.The proposed system has been evaluated for news crawling,news content retrieval and news summarization.The evaluation results shown that the proposed system performs better in summarizing the news contents to the end users.

The paper is organized as following sections.In Section 2,the related works of the clustering and news summarization mechanisms are discussed.In Section 3,the architecture of the news retrieval system is explained.In Section 4,the experimental results of the proposed system are discussed.In Section 5,the performance evaluation of the parallel crawler,hierarchical clustering and news summarization method are explained.In Section 6,the conclusion of the work is given.

2 Related Works

In recent years,there are lot of online recommendation systems available for assisting online shopping to various users depending upon their knowledge level.Here,we have discussed various methods related to the data collection,domain corpus,hierarchical clustering and summarization.RSS new feeds are the important sources of information from different online websites.The users are subscribing to only the required feed updates [5].In addition to the RSS feeds,Google news is providing news updates on various domains.In addition,the news contents are extracted for building corpuses which help domain oriented news analysis [6].Wordnet [7] is the prominently used Synset generator along with the tagger.The feed updates contain titles which has keywords that are used to identify the domains.They have used wordnet for tagging the keywords and identify the domain wise data.

Multi granularity hierarchical representation [8] is the content representation of the data for easy access of the fine grain level data.The authors have employed this method for the systematic organization of the content and its retrieval.Further,RSS news feeds are represented in Extensible Mark-up Language (XML) formats [9].This method is effective if the similar news items are merged together to gather the news from various sources.The relatedness between the RSS elements is also identified to merge the contents effectively.The RSS news articles are collected from various sources.In many cases,the news articles are redundant [10] in content wise.These redundant articles may be eliminated and the distinctive news articles may be clustered for later access.News content clustering and recommendation requires the categorization of the users in the web and their web browsing behavior needs to be analysed.The authors [11] have used user behavior data along with collaborative filtering for recommending the specific user interested content.Latent semantic analysis use mapping of high dimensional and sparse words into a semantic space with the correlation among the words [12].Text analysis model [13] uses deep learning techniques for effective product recommendation to the users.

In addition,it is essential to summarize the categorized news contents to the respective users.Extractive summarization [14] is one of the summarization techniques.It captures the sentences from the documents and generates the summary from the captured contents.Contextual information is used with the captured contents to generate effective summaries.The social media contents are summarized [15] topic wise and given to the users.Further,some of the semantic based clustering [16] is helpful in generating summaries from non-conclusive short texts.External knowledge resources are used for establishing the semantic relations among the text contents.Multi-document summarization [17] system is used for generating summaries from multiple document collections.The best summary is generated by estimating the information distance among the document collections.Many works have been carried out in the literature for incorporating the credible features of few existing mechanisms for developing a system with better performance.Few such works are used in building prediction models [18] and creating fake news detection system [19].In some works,the deep learning-based algorithms are used as risk analysis models and building mechanisms for defensing from denial-of-service attacks [20,21].The comparison of various methodologies related to the proposed system has been tabulated in Tab.1.

Table 1:Comparison of related works with merits and demerits of the methodologies

This research paper work is motivated and inspired by the related works discussed in this section.Our proposed system provides an improvement to the news summarization methods for news data streams and content retrieval is simplified with hierarchical news content clustering and user collaborative filtering.The quality of summary generation has significantly improved.

3 Collaborative Filtering Based NEWS Retrieval System

Hierarchical clustering is applied in many of the content retrieval system.Since hierarchical structure provides topic wise categorical representation elegantly,it is widely encouraged in most of the content structuring works.The retrieval time is considerably less in hierarchical structured content retrieval system [22].It performs well in processing the user given query and recommend better results from the hierarchically arranged contents for summary generation.Hence,we have proposed hierarchically clustering based news summarization system for generating effective news summaries in less time.The flow chart of the proposed system is illustrated in Fig 1.

Figure 1:Flow chart of the collaborative filtering based news retrieval system

The architecture of the proposed system is shown in Fig 2.The feed collector helps to collect the news feeds from various news Uniform Resource Locator (URL).Further,the collected feeds are checked for the domain specification in the title content available in the feed and the domain of the feed is identified.Various domain corpuses along with wordnet are used for checking the domain of the feeds.Hierarchical clustering is used for clustering the news articles category wise.It performs the categorization of the news contents and organizes the content topic wise.The user queries are obtained and the summaries are given as a result to the users.The user queries are natural language-based keywords.The summarizer generates the summary both topic wise and magazine wise.In addition,the user given keyword specific news contents are also retrieved from the repository.

Figure 2:Collaborative filtering based NEWS retrieval system

4 Experimental Results and Discussion

The significance of this research work focusses on collecting the news data dynamically and organizing the news data hierarchically.Further,the news contents are summarized effectively based on the user given query by processing with the collaborative filtering method.

4.1 Dataset

The dataset used in this work,is collected from the news sources using the news crawler program which we implemented in our system as part of news summarization system.The RSS feed news and news data streams are monitored and collected from google news [23].

4.2 News Data Collection

In this work,the news summaries are generated based on the user interest using the news updates received from numerous sources.To perform this,the first stage of work considered in this paper,is the data collection from various sources.The hierarchical structure is created for various domains.For example,Sports news are categorized with different types like cricket,football,basketball,etc.In addition,the region wise hierarchy is also represented to easily identify the location of the news such as country,state,district,city,etc.The consolidated summary of the news data collection is shown in Tab.2 wherein the news source and the topics and its news updates count are tabulated.The news content collection is observed for 1-day,1-week,1-month and 3-month period.The news articles collected during these periods is illustrated in Tab.3.

Table 2:Summary of various news sources,news topics and news updates count

Table 3:Weekly,monthly statistics of the news feed updates

4.3 Domain Corpuses

Around 97000 words are available in political domain corpus [24] and it has been applied in tagging the keywords from news updates.Healthcare domain consists of around 60000 words [25]and it has been incorporated to identify similar terms in the news contents.The business corpus contains around 600000 words [26] and it is used to know the business keywords present in it.Sports domain consists of keywords from various sports events.Around 32000 sports related words are available in sports corpus [27].Education domain contains the terms prominently used in education related activities.There are around 84000 words available in education domain [28].Electronics,nature,software and travel domain corpuses are taken from Wikipedia corpus collection [29].

4.4 Hierarchical Clustering of News Articles

Cosine similarity is determined to find the similar content existing in the news updates.Hierarchical clustering algorithm is used to detect the hierarchical structures among the news articles.The algorithm is shown as follows.

Algorithm 1: Hierarchical Clustering Algorithm Input Data: D={d1,d2,d3,...,dn} data set containing ‘n’news articles Result: hierarchy of clusters begin Initialize clusters Assign Ci=xi wherein xi represents data points and i=1 to n Create cluster for each news article loop for i ←1 to n do for j ← 1 to n do d(i,j) ←compute_similarity(xi,xj)calculate similarity measure identify appropriate cluster for each xi in C do αi=score(xi) //assigns similarity score value to alpha Clusterd=highest(αi) // similarity score makes cluster distance among clusters(Ci,Cj)←Cd(Cmin1,Cmin2)=minimum_dist (Ci,Cj) for all Ci,Cj in C remove Cmin1 and Cmin2 from C add {Cmin1,Cmin2} to C clusters=clusters + 1 assign data to closest clusters Ci or Cj remove from current clusters until all clusters generated return clusters end

The domain of the cluster is also identified with the clustering process.The cluster formation from various news articles is tabulated in Tab.4.The top 3 clusters formed out of the news articles received in a particular interval is mentioned in Tab.5

The clustered articles with its corresponding clusters and domain,is shown in Tab.6.It contains the cluster category,cluster topic,number of feeds and number of news articles.

4.5 Collaborative Filtering Based NEWS Content Retrieval

The collaborative filtering algorithm is used to filter the similar news content among the interested users.The similar news content is added to the recommendation set.The collaborative filtering based score is calculated for every similar news content and the news with maximum score is recommended to the user.The collaborative filtering algorithm is shown as follows.

Table 4:Cluster formation from news articles

Table 5:Top 3 cluster information

Table 6:Cluster categories,Topics with its feeds and article statistics

Algorithm 2: User based Collaborative Filtering Input: user set ‘S’,user query data ‘U’Output: Suggested results ‘R’R ←null foreach u in S do C(u) ←null,C represents collaborativeness neighbour(u) ←null,c ←n,wherein ‘n’represents‘news sequence’foreach v in S do if Sim (u,v) ＞=T,T represents filtering threshold neighbour(u) ←neighbour(u) U v if vi + 1 exists then C(u) ←C(u) U vi + 1 for nj in C(u)compute rj //computes the recommendation rj ←nj //mapping user to recommendation R ←R U rj return R //returns the recommended news to users

The results of the collaborative filtering algorithm are shown in Tab.7.

Table 7:Collaborative filtering accuracy using recommended articles

4.6 News Content Summarization

We have applied Extraction based summarization algorithm as a baseline method for performing document summarization using multiple document contents.Further,we have computed the probability distribution of the news for summary generation.The sentence with maximum score is taken for summary generation.The summarization steps are represented as follows.

Step 1.Calculate the probability distribution of the words in the news content.Let wi be the words and p(wi) be the probability of the words,where i represents the words.Step 2.Calculate the average probability of words in the news content.weight(Sj)=∑wi∈Sj p(wi)|{wi ∈Sj}|,Sj—sentences Step 3.Choose the top score sentence that has the high probability word.Step 4.Update probability for word chosen pnew(wi)=pold(wi).pold(wi)Step 5.Generate summary based the probability estimation of words in the sentences

The information about the user submitted query and the summary generation details from the news feeds is shown in Tab.8.

Table 8:Statistics about Query vs. Summary

Tab.9 shows the generated summary with the news feed count and article count for the user given query.The summaries generated for various user given queries are shown in Tab.10.

Table 9:Query based summary with news feed and article count

Table 10:Summary for the user given query

The summary generated for the actual google news is shown in Tab.11.Here,the summary is generated from 2 different news article contents.

5 Performance Evaluation

5.1 News Crawler

The news collection time for different number of URLs using various crawlers is tabulated in Tab.12.

The news collector is compared with different news crawler and is shown in Fig.3.The news collector results indicate that the news collector is performing faster than other news collecting crawlers for any number of feed URLs.This is achieved with the parallel crawler which performs the news collection by sharing the URLs to multiple thread program to run parallelly.

Table 11:Summarization results

Table 12:News crawling time for different set of feed URLs using various crawlers

5.2 News Retrieval Efficiency

The similar relevant keywords of the user given input are generated and the retrieval performance is evaluated.The news retrieval performance for direct user queries and relevance keywords is shown in Figs.4 and 5 respectively.

5.3 Query Evaluation

The user queries are evaluated on pre-processed keyword indexing,non-pre-processed keyword indexing and non-indexing news contents.The query processing time is tabulated in Tab.13.The comparison of query processing time for different indexing based retrieval is shown in Fig.6.

5.4 Evaluation of Summarization

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating the summarization.It compares the summary against a set of references summary generated by human [33].This quantitative of overlapping words is measured using the precision.

Figure 3:Comparison of different news crawlers

Figure 4:News retrieval performance for direct user queries

Figure 5:News retrieval performance for relevance keywords of user queries

Table 13:Query processing time using various indexed news contents

Figure 6:Query processing time for various indexing based retrieval

The precision of the automatic summarization is shown in Tab.14.It means that the precision is 1.0 that is all the words in the reference summary is available in the automatic system summary.The precision calculated using the system summary is 0.88.

Table 14:Precision of automatic summarization

Further,we applied ROUGE specific metrics for effectively measuring the summary generation.The measures are ROUGE-N,ROUGE-S,ROUGE-L.These refers the size of the texts compared among the system summary and reference summary.ROUGE-1 refers the overlap of unigrams among the reference and system summaries.ROUGE-2 refers the overlap of bigrams among the reference and system summaries.ROUGE-1 and ROUGE-2 are the ROUGE-N type measures.It is referenced in the literature that ROUGE-1 and ROUOGE-L are appropriate for extractive summarization [34].

We have observed from the summarization evaluation that the ROUGE-N and ROUGE-L measures indicated that 88.88% and 77.77% of the actual news content is covered by the news summary generated.Since ROUGE-L needs to measure the longest sentence covered in the summary,the received value is a good measure that it has generated a summary covering the required sentences.The summarization performance of the proposed system is compared with other methodologies used in the literature for the summarization of document contents.The comparison result has been ensured with the ROUGE-1 metric which is the appropriate measure for news text summarization.The comparative results are tabulated in Tab.15.It is observed from the result that the proposed system is highly useful for effectively summarizing the dynamically collected news data.

Table 15:Comparison of summarization performance

5.5 Computational Complexity

The news data streams are received and the similarity needs to be estimated.The similarity computation involves the use of similarity matrix.It requires little large memory than other clustering algorithms since it needs to keep the data elements to store the matrix values.

Space complexity=O(n2)

Even hierarchical clustering takes more space,it is widely used in many of content organization systems.The hierarchical clustering algorithms satisfy reducibility property.The increased computational time required for generating the clusters help in providing the hierarchy of cluster set with exact and unique structure with this reducibility property.

5.6 Scope and Application

Mainly,in this work,the automatic news summarization system for the dynamic news articles with timeframes from google news.The scope of the proposed collaborative filtering based news retrieval system includes concise information from various news articles.It helps to eliminate the difficulty of going through huge news articles and provides 20% to 30%from the original news content.The scope is limited to generate the summary for the user interested keyword using the news articles in a time frame.This news retrieval system helps in a better way for the online news content editors who are in need of accessing the interested domain content immediately.

6 Conclusion

In this paper,the hierarchical clustering based news summarization system has been proposed to apply on RSS feed based news and google news.The news crawler used thread based news crawling to collect the news articles effectively with better collection efficiency which has been compared with various state of the art news crawlers.This work used various recent domain corpuses to tag and extract the topic wise news efficiently.The hierarchical clustering handled the news contents by estimating the similarity and produced the hierarchical clusters of the various domains appropriately.The evaluation of the automatic summary with the human generated summary models proved that it performed maximum for the hierarchically clustered news article contents.Hence,proposed news summarization system is suitable and useful for the content readers who are keen in knowing recent domain specific news with the generated summary from various news sources.

Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.

Computers Materials&Continua2022年1期

Computers Materials&Continua的其它文章: Multiscale Image Dehazing and Restoration:An Application for Visual Surveillance; Measuring End-to-End Delay in Low Energy SDN IoT Platform; Deep Optimal VGG16 Based COVID-19 Diagnosis Model; Customer Prioritization for Medical Supply Chain During COVID-19 Pandemic; A New Fuzzy Adaptive Algorithm to Classify Imbalanced Data; A Hybrid Approach for Network Intrusion Detection