Please use this identifier to cite or link to this item:
https://ptsldigital.ukm.my/jspui/handle/123456789/513464
Title: | Framework for deviation detection in text |
Authors: | Siti Sakira binti Kamaruddin (P37923) |
Supervisor: | Abdul Razak Hamdan, Prof. Dr. |
Keywords: | Framework Deviation detection Text Data mining |
Issue Date: | 7-Jul-2011 |
Description: | The detection of deviations in text is an important research area and has gain considerable attention from the data mining community. Text deviations are implicit knowledge that distinctively deviates from the general information contained in textual data. This thesis describes original research in the field of text deviation detection by presenting a novel framework for deviation detection in alphanumeric text using Conceptual Graph Interchange Format (CGIF) representation. In order to achieve the main objective, a text mining method is proposed to extract relevant information before transforming it into CGIFs. Another aim is to develop linear deviation detection method with a novel error tolerance dissimilarity function to detect deviations from the CGIFs. The framework addresses three non-trivial problems; 1) the high dimensionality problem of text data, 2) the effective representation scheme to capture the semantics of the text and 3) the NP-complete problem of graph matching. The current state of the art for text mining adopts shallow methods such as text categorization and summarization to produce a running system quickly however, the results are less accurate. Most published work in text representation, models individual words by calculating word frequencies and applying vector representation to identify similarities. These methods fail to capture the semantics of the whole sentences. Graphical representation schemes such as CGIF is more appropriate to model the semantics of sentences, however it brought out the third problem; the NPcomplete problem of graph matching. Most research work tackled the problem by either focusing on structural similarity or conceptual similarity alone. Others, considered both concepts and structure however, the computation is at best polynomial and require complex clustering method. The approach proposed in this thesis reduces the dimensionality of text by performing information extraction. To improve the accuracy of the extraction results, a rule-based method combined with Natural Language Processing (NLP) method is proposed. This method repeatedly scans the given document to learn and identify patterns and successfully extract the relevant information. The meaning of sentences is essential in detecting deviations therefore all extracted sentences were represented as CGIF to conquer its semantics. The produced CGIFs were processed to detect deviations. For this purpose, a deviation based detection method which implements a new error tolerance dissimilarity algorithm was proposed. The inclusion of synonym embedded standard CGIF in the proposed method reduces the complexity of graph matching from NPcomplete or polynomial to liner complexity. To demonstrate the feasibility of the proposed method, nine years of real world financial statements were selected as the domain of the research. Experimental evaluation revealed that the proposed method has significantly reduced the search space and accurately detected the deviating knowledge from the financial statements. When compared with two other concept similarity measures, the experimental result shows that the proposed method outperforms the others with an improved accuracy comparable to the expert's judgement. Overall the method is far less computationally demanding compared to other similar methods in the area.,PhD |
Pages: | 269 |
Call Number: | QA76.9.D343.S643 2011 |
Publisher: | UKM, Bangi |
Appears in Collections: | Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
ukmvital_74316+Source01+Source010.PDF Restricted Access | 1.71 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.