A text categorization based on the effect of text summarization

Amiri Dorna

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476388

Title:	A text categorization based on the effect of text summarization
Authors:	Amiri Dorna
Supervisor:	Mohd. Juzaiddin Ab Aziz, Prof. Dr.
Keywords:	Text Categorization Text processing (Computer science)
Issue Date:	5-Dec-2012
Description:	Text Categorization (TC) is the task of automatically assigning a set of documents into a set of predefined categories. The problem of Text Categorization is that when all the terms (features) within documents are taken as the feature set, it leads to high dimensional feature space, which makes the computing process difficult and time consuming. This work focuses on applying text summarization (TS) as an effective feature selection technique in TC to handle the mentioned problem. This research aims at TC based on a graph-based summarization approach. Feature selection plays a great role in TC by selecting informative features. Although current feature selection methods evaluate features well but they don’t have the ability to reduce the feature set size. In preprocessing phase. WordNet, a lexical database for the English language , and a stemmer based on porter stemming algorithm were applied. A text summary is a shorter version of the original text which contains the standpoints and main information of it , hence it was used as a replacement. TS was done by applying TextRank model which is a graph-based approach. It ranks all the sentences exist in a document based on the importance of each sentence. The summary that is constructed by selecting 10%, 20% and 30% of important sentences, then is used directly to select features. The machine learning algorithm which classifier was trained according to it , is k-nearest neighbour (KNN). KNN classifies unlabeled documents in a test set, based on labeled documents in a training set , and assigns each to its relevant category. In this work “hard categorization” method was taken into consideration in which a document can be assigned to just one category. The corpus was collected from online news agencies. The results reveal that graph based TS on the train set alleviates the process of TC by affecting the feature set size. This effect leads to the reduction of time for classifier training and also the reduction of calculation complexity. 60% of the collected documents were considered as the train set and the remained 40% as the test set . 10% summary,20%summary,30% summary were tested on the proposed method and 20% summary showed the best performance.,Master / Sarjana
Pages:	93
Call Number:	QA76.9.T48A475 2013 3 tesis
Publisher:	UKM, Bangi
URI:	https://ptsldigital.ukm.my/jspui/handle/123456789/476388
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
ukmvital_84861+SOURCE1+SOURCE1.0.PDF Restricted Access		1.87 MB	Adobe PDF	View/Open

Show full item record Recommend this item