Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476117
Title: Feature selection using positive pointwise mutual information
Authors: Azadeh Amiri (P53606)
Supervisor: Mohd. Juzaiddin Ab Aziz, Prof. Dr.
Keywords: Selection
Positive pointwise
Information
Text processing (Computer science)
Issue Date: 5-Dec-2012
Description: As the rapid growth of information has increased the number of documents about the same topic and reading all these texts requires significant amount of time, the need for effective techniques to assess the main idea of the documents arises. A summary creates a text which includes the most important parts of the main text(s) by removing the unimportant parts in order to keep the useful concepts. The problem of text summarization system is that when the number of document increases, the numerous features makes the sentence selection difficult and complicated. Therefore, this work focuses on feature selection to tackle the problem. This research aims at identifying a method for selecting features effectively in multi-document summarization. This work involves pre-processing, feature selection, hierarchical clustering and TextRank Model. WordNet, an electronic lexical database for the English language which groups words into synsets and record semantic relation between these sets, is used. Porter algorithm is applied to stem English words. In feature selection, the Term Frequency - Inverse Document Frequency (TF-IDF) weighting method which Vector Space Model (VSM) adopts in calculating the feature weights and Positive Pointwise Mutual Information (PPMl) were used. The hierarchical clustering algorithm was implemented based on similarity calculation. In order to extract the most significant sentence from each cluster The TextRank model was used in order to rank the sentences in the clusters. The methods were tested on the corpus gathered from online news agencies. The results reveal that PPMI method and graph-based algorithm can be successfully used together. The created summaries at least cover 75% and at most cover 95% of main ideas of the news using PPMI+TF-IDF and at least cover 80% and at most cover 95% using only TF-IDF.,Master/Sarjana
Pages: 100
Call Number: QA76.9.T48.A473 2012 3
Publisher: UKM, Bangi
URI: https://ptsldigital.ukm.my/jspui/handle/123456789/476117
Appears in Collections:Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:
File Description SizeFormat 
ukmvital_73413+Source01+Source010.PDF
  Restricted Access
1.28 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.