Title
An extensive empirical study of feature terms selection for text summarization and categorization
Abstract
The ever-increasing availability of online textual data bases and the development of Internet have necessitated intensive research in the area of automatic text summarization within the Natural Language Processing (NLP) community. Researchers and students constantly face the problem when they work on a research project that, it is almost impossible to read most of the newly published papers. The goal of text summarization based on extraction approach is sentences selection. One of the methods to obtain the sentences is to assign some feature terms of sentences for the summary called ranking sentences and then select the best ones. Broad indexing and speedy search alone are not enough for effective retrieval. Categorized data are easy for user to browse if the data is well organized. In the first stage each document is prepared by preprocessing process: sentence segmentation, tokenization, stop word removal, case folding, lemmatization, and stemming. Then, we used important features, sentence filtering features, data compression features and finally calculate their score for each sentence. We proposed text summarization based on HMM tagger to improve the quality of the summary. By creating impressions the documents are also categorized. We compared our results with the Copernicus summarizer, Great summarizer and Microsoft Word 2007 summarizers etc. The proposed system is tested with four types' similarities: Cosine, Jaccard, Jaro-winkler and Sorenson similarities. The results show that the best quality for the summaries was obtained by feature terms method. Our text categorization approach is validated with Naïve Bayesian, Decision Tree Induction, KNN and SVM approaches.
Year
DOI
Venue
2012
10.1145/2393216.2393317
CCSEIT
Keywords
Field
DocType
sentence segmentation,automatic text summarization,ranking sentence,feature terms selection,copernicus summarizer,extensive empirical study,data compression feature,categorized data,text categorization approach,sentences selection,text summarization,online textual data base,natural language processing,term frequency
Text graph,Tokenization (data security),Automatic summarization,Naive Bayes classifier,Computer science,Search engine indexing,Artificial intelligence,Natural language processing,Sentence,Word processing,Stop words
Conference
Citations 
PageRank 
References 
2
0.36
5
Authors
2
Name
Order
Citations
PageRank
Suneetha Manne121.04
S. Sameen Fatima220.70