A text categorization based on the effect of text summarization

Amiri Dorna (P53644)

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476388

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Mohd. Juzaiddin Ab Aziz, Prof. Dr.
dc.contributor.author	Amiri Dorna (P53644)
dc.date.accessioned	2023-10-06T09:17:35Z	-
dc.date.available	2023-10-06T09:17:35Z	-
dc.date.issued	2012-12-05
dc.identifier.other	ukmvital:84861
dc.identifier.uri	https://ptsldigital.ukm.my/jspui/handle/123456789/476388	-
dc.description	Text Categorization (TC) is the task of automatically assigning a set of documents into a set of predefined categories. The problem of Text Categorization is that when all the terms (features) within documents are taken as the feature set, it leads to high dimensional feature space, which makes the computing process difficult and time consuming. This work focuses on applying text summarization (TS) as an effective feature selection technique in TC to handle the mentioned problem. This research aims at TC based on a graph-based summarization approach. Feature selection plays a great role in TC by selecting informative features. Although current feature selection methods evaluate features well but they don’t have the ability to reduce the feature set size. In preprocessing phase. WordNet, a lexical database for the English language , and a stemmer based on porter stemming algorithm were applied. A text summary is a shorter version of the original text which contains the standpoints and main information of it , hence it was used as a replacement. TS was done by applying TextRank model which is a graph-based approach. It ranks all the sentences exist in a document based on the importance of each sentence. The summary that is constructed by selecting 10%, 20% and 30% of important sentences, then is used directly to select features. The machine learning algorithm which classifier was trained according to it , is k-nearest neighbour (KNN). KNN classifies unlabeled documents in a test set, based on labeled documents in a training set , and assigns each to its relevant category. In this work “hard categorization” method was taken into consideration in which a document can be assigned to just one category. The corpus was collected from online news agencies. The results reveal that graph based TS on the train set alleviates the process of TC by affecting the feature set size. This effect leads to the reduction of time for classifier training and also the reduction of calculation complexity. 60% of the collected documents were considered as the train set and the remained 40% as the test set . 10% summary,20%summary,30% summary were tested on the proposed method and 20% summary showed the best performance.,Master / Sarjana
dc.language.iso	eng
dc.publisher	UKM, Bangi
dc.relation	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat
dc.rights	UKM
dc.subject	Text Categorization
dc.subject	Text processing (Computer science)
dc.title	A text categorization based on the effect of text summarization
dc.type	theses
dc.format.pages	93
dc.identifier.callno	QA76.9.T48A475 2013 3 tesis
dc.identifier.barcode	002015
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
ukmvital_84861+SOURCE1+SOURCE1.0.PDF Restricted Access		1.87 MB	Adobe PDF	View/Open

Show simple item record Recommend this item