Framework for deviation detection in text

Siti Sakira binti Kamaruddin

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/513464

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Abdul Razak Hamdan, Prof. Dr.	-
dc.contributor.author	Siti Sakira binti Kamaruddin	-
dc.contributor.other	P37923	-
dc.date.accessioned	2023-10-16T04:36:58Z	-
dc.date.available	2023-10-16T04:36:58Z	-
dc.date.issued	2011-07-07	-
dc.identifier.other	P37923	-
dc.identifier.uri	https://ptsldigital.ukm.my/jspui/handle/123456789/513464	-
dc.description	The detection of deviations in text is an important research area and has gain considerable attention from the data mining community. Text deviations are implicit knowledge that distinctively deviates from the general information contained in textual data. This thesis describes original research in the field of text deviation detection by presenting a novel framework for deviation detection in alphanumeric text using Conceptual Graph Interchange Format (CGIF) representation. In order to achieve the main objective, a text mining method is proposed to extract relevant information before transforming it into CGIFs. Another aim is to develop linear deviation detection method with a novel error tolerance dissimilarity function to detect deviations from the CGIFs. The framework addresses three non-trivial problems; 1) the high dimensionality problem of text data, 2) the effective representation scheme to capture the semantics of the text and 3) the NP-complete problem of graph matching. The current state of the art for text mining adopts shallow methods such as text categorization and summarization to produce a running system quickly however, the results are less accurate. Most published work in text representation, models individual words by calculating word frequencies and applying vector representation to identify similarities. These methods fail to capture the semantics of the whole sentences. Graphical representation schemes such as CGIF is more appropriate to model the semantics of sentences, however it brought out the third problem; the NPcomplete problem of graph matching. Most research work tackled the problem by either focusing on structural similarity or conceptual similarity alone. Others, considered both concepts and structure however, the computation is at best polynomial and require complex clustering method. The approach proposed in this thesis reduces the dimensionality of text by performing information extraction. To improve the accuracy of the extraction results, a rule-based method combined with Natural Language Processing (NLP) method is proposed. This method repeatedly scans the given document to learn and identify patterns and successfully extract the relevant information. The meaning of sentences is essential in detecting deviations therefore all extracted sentences were represented as CGIF to conquer its semantics. The produced CGIFs were processed to detect deviations. For this purpose, a deviation based detection method which implements a new error tolerance dissimilarity algorithm was proposed. The inclusion of synonym embedded standard CGIF in the proposed method reduces the complexity of graph matching from NPcomplete or polynomial to liner complexity. To demonstrate the feasibility of the proposed method, nine years of real world financial statements were selected as the domain of the research. Experimental evaluation revealed that the proposed method has significantly reduced the search space and accurately detected the deviating knowledge from the financial statements. When compared with two other concept similarity measures, the experimental result shows that the proposed method outperforms the others with an improved accuracy comparable to the expert's judgement. Overall the method is far less computationally demanding compared to other similar methods in the area.,PhD	-
dc.language.iso	may	-
dc.publisher	UKM, Bangi	-
dc.relation	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat	-
dc.subject	Framework	-
dc.subject	Deviation detection	-
dc.subject	Text	-
dc.subject	Data mining	-
dc.title	Framework for deviation detection in text	-
dc.type	Theses	-
dc.rights.holder	UKM	-
dc.format.pages	269	-
dc.identifier.callno	QA76.9.D343.S643 2011	-
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
Framework for deviation detection in text.pdf Restricted Access	Full text	1.71 MB	Adobe PDF	View/Open

Show simple item record Recommend this item