Please use this identifier to cite or link to this item:
https://ptsldigital.ukm.my/jspui/handle/123456789/513378
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Kamsuriah Ahmad, Assoc. Prof. Dr. | - |
dc.contributor.author | Saleh Rehiel A. Alenazi (P78194) | - |
dc.date.accessioned | 2023-10-16T04:36:02Z | - |
dc.date.available | 2023-10-16T04:36:02Z | - |
dc.date.issued | 2019-03-18 | - |
dc.identifier.other | ukmvital:121939 | - |
dc.identifier.uri | https://ptsldigital.ukm.my/jspui/handle/123456789/513378 | - |
dc.description | Record duplication deduction is an identification of similar records that appear more than ones in data warehouse. Redundancy of records happens when heterogeneity arises as data from several sources are integrated into a database. Various approaches have been explored to solve the problems of record duplications deduction in large databases. These approaches normally focus on improving two processes namely: record preparation strategy and similarity measurement among records. Existing approaches in record preparation strategy namely window-based, block-based and machine learning procedures are utilized to ensure considerable reduction of processing required. Windowing or blocking entails an initial preparation of the records such as partitioning or sorting the records before comparison. Hence, the need is to set an appropriate size of the window or block of records for comparison where inappropriate choices often lead to missing or unnecessary comparison. On the other hand, machine learning requires re-training the machine to understand the different data standards. However, the limitations of these approaches are an increased number of comparisons between records and require longer processing time. Providing a solution to overcome these limitations becomes the motivation of this study. The adaptive tree approach which is based on binary tree is proposed based on indexing proximity to overcome the limitations identified by windowing, blocking and machine learning approaches. The second important process is measuring similarity among records where Jaro-Winkler is a widely technique used owing to its accuracy having a transposition component that requires an improvement. Therefore, the nontransposition Jaro-Winkler measurement, has been proposed to ensure a more accurate similarity measurement process. Accordingly, the aim of this study is to propose a better Record Duplication Deduction based on adaptive binary tree (A-Tree) and nontransposition Jaro-Winkler (NT-Jaro Winkler) to make an improvement in terms of accuracy and processing time. To achieve this aim, design-based research (DBR) methodology is adapted consisting of five main phases namely: Problem Identification, Theoretical Analysis, Proposed Solution, Tool Development and Experimental Design and Evaluation. To evaluate the effectiveness of the proposed two processes, three experiments are conducted. By using three real datasets, the results show that A-Tree combined with NT-Jaro Winkler had better accuracy, lower time complexity and number of comparison results compared to the rest existing window approaches, where an F-score value reached 0.980, 0.988 and 0.998, and number of comparisons 134, 1874 and 1231 in Restaurant, Cora Citation Machine and Cora Information Extraction respectively. It is hoped that this study would help getting high quality data and serve as a guidance to implement a better initiative for data storage system.,Ph.D,Certification of Master's / Doctoral Thesis" is not available" | - |
dc.language.iso | eng | - |
dc.publisher | UKM, Bangi | - |
dc.relation | Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat | - |
dc.rights | UKM | - |
dc.subject | Universiti Kebangsaan Malaysia -- Dissertations | - |
dc.subject | Dissertations, Academic -- Malaysia | - |
dc.subject | Record duplication deduction | - |
dc.subject | Data warehouse | - |
dc.subject | Databases | - |
dc.title | An enhanced approach for record duplication deduction with a-tree and non-transposition Jaro-Winkler | - |
dc.type | Theses | - |
dc.format.pages | 239 | - |
dc.identifier.barcode | 005773(2021)(PL2) | - |
Appears in Collections: | Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
ukmvital_121939+SOURCE1+SOURCE1.0.PDF Restricted Access | 764.9 kB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.