Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476308
Title: Automatic detection of cross-language plagiarism using linear logistic regression for Arabic-English document
Authors: Zaid Alaa Abbood (P72236)
Supervisor: Sabrina Tiun, Dr.
Keywords: Cross-language plagiarism
Text languages
Arabic-English
Detection method
Dissertations, Academic -- Malaysia
Issue Date: 17-Aug-2015
Description: Cross-language Plagiarism Detection (CLPD) is used to automatically identify and extract plagiarism among documents in different languages. The main challenge of cross-language plagiarism detection is the difference of text languages, where the original source can be analyzed and translated, and then plagiarism can be detected automatically by comparing suspected text with the original text. This thesis proposes an Arabic-English cross-language plagiarism detection method by automatically detect the semantic relatedness between the words of two suspect targeted files. The proposed method consists of six phases: The first phase is a pre-processing phase, where the texts were pre-processed using the common pre-processing methods such as the tokenization process and remove stops word process. The second involves key phrase extraction and translation, where the text documents can be characterized by a set of keywords giving an idea on what the text is about and the online translation system (Google translate) is used to perform the translation part. The third phase retrieves the candidate document that match with the key phrase of the proposed plagiarism text. The fourth phase is a similarity measurement between the key phrases by measuring the similarity between the original text and plagiarism text, with the effect of four features of similarity (Cosine Words Similarity (CS), Longest Common Subsequence (LCS) and Substrings N-gram and as well as the combined features. The fifth phase is the classification process using Linear Logistic Regression (LLR) approach that responsible on detecting the plagiarism based on two binary possibilities; (1) plagiarized parts of document, and (0) not plagiarized parts of document. The last phase is an evaluation phase using Precision, Recall and F-measure on dataset consisting of Wikipedia articles. The experimental implementation was down with C# language and achieved 97% Precision, 85% Recall and 90% F-measure. The results show that the LLR algorithm with the combined three measures (CS, LCS and N-gram) can be used effectively to detect Arabic-English cross-language plagiarism.,Master of Computer Science
Pages: 72
Publisher: UKM, Bangi
Appears in Collections:Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:
File Description SizeFormat 
ukmvital_81672+SOURCE1+SOURCE1.0.PDF
  Restricted Access
271.01 kBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.