Automatic detection of cross-language plagiarism using linear logistic regression for Arabic-English document

Zaid Alaa Abbood

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476308

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Sabrina Tiun, Dr.
dc.contributor.author	Zaid Alaa Abbood
dc.contributor.other	P72236	-
dc.date.accessioned	2023-10-06T09:16:11Z	-
dc.date.available	2023-10-06T09:16:11Z	-
dc.date.issued	2015-08-17
dc.identifier.other	ukmvital:81672
dc.identifier.other	P72236	-
dc.identifier.uri	https://ptsldigital.ukm.my/jspui/handle/123456789/476308	-
dc.description	Cross-language Plagiarism Detection (CLPD) is used to automatically identify and extract plagiarism among documents in different languages. The main challenge of cross-language plagiarism detection is the difference of text languages, where the original source can be analyzed and translated, and then plagiarism can be detected automatically by comparing suspected text with the original text. This thesis proposes an Arabic-English cross-language plagiarism detection method by automatically detect the semantic relatedness between the words of two suspect targeted files. The proposed method consists of six phases: The first phase is a pre-processing phase, where the texts were pre-processed using the common pre-processing methods such as the tokenization process and remove stops word process. The second involves key phrase extraction and translation, where the text documents can be characterized by a set of keywords giving an idea on what the text is about and the online translation system (Google translate) is used to perform the translation part. The third phase retrieves the candidate document that match with the key phrase of the proposed plagiarism text. The fourth phase is a similarity measurement between the key phrases by measuring the similarity between the original text and plagiarism text, with the effect of four features of similarity (Cosine Words Similarity (CS), Longest Common Subsequence (LCS) and Substrings N-gram and as well as the combined features. The fifth phase is the classification process using Linear Logistic Regression (LLR) approach that responsible on detecting the plagiarism based on two binary possibilities; (1) plagiarized parts of document, and (0) not plagiarized parts of document. The last phase is an evaluation phase using Precision, Recall and F-measure on dataset consisting of Wikipedia articles. The experimental implementation was down with C# language and achieved 97% Precision, 85% Recall and 90% F-measure. The results show that the LLR algorithm with the combined three measures (CS, LCS and N-gram) can be used effectively to detect Arabic-English cross-language plagiarism.,Master of Computer Science
dc.language.iso	eng
dc.publisher	UKM, Bangi
dc.relation	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat
dc.subject	Cross-language plagiarism
dc.subject	Text languages
dc.subject	Arabic-English
dc.subject	Detection method
dc.subject	Dissertations, Academic -- Malaysia
dc.title	Automatic detection of cross-language plagiarism using linear logistic regression for Arabic-English document
dc.type	theses
dc.rights.holder	UKM	-
dc.format.pages	72
dc.identifier.barcode	002220
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
ukmvital_81672+SOURCE1+SOURCE1.0.PDF Restricted Access		271.01 kB	Adobe PDF	View/Open

Show simple item record Recommend this item