Arabic part of speech tagging using k-nearest neighbour and naive bayes classifiers combination

Rund Fareed Suleiman Mahafdah (P68502)

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/462652

Title:	Arabic part of speech tagging using k-nearest neighbour and naive bayes classifiers combination
Authors:	Rund Fareed Suleiman Mahafdah (P68502)
Supervisor:	Prof. Dr. Nazlia Binti Omar
Keywords:	Speech processing systems Computational linguistics Arabic language-Morphology.
Issue Date:	6-Mar-2014
Description:	Part of speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications like text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Various works on different POS tagger techniques have been developed and experimented in the literature. Currently, it is well known that some POS tagging models are not performing well on the Quranic Arabic due to the complexity of the Quranic Arabic text. This complexity presents several challenges for POS tagging such as high ambiguity, data sparseness and large existence of unknown words. The main problem here is to find out how existing and efficient methods perform in Arabic and also how can Quranic corpus be utilized to come up with an efficient framework for Arabic POS tagging. We propose a classifiers combination experimental framework for Arabic POS tagger, by selecting two best diverse probabilistic classifiers used in numerous works in non-Arabic language; K-Nearest Neighbour (KNN), Naive Bayes (NB). The Majority voting is used here as the combination strategy to exploit classifiers advantages. In addition, an in-depth study has been conducted on a large list of features for exploiting effective features and investigating their role in enhancing the performance of POS taggers for the Quranic Arabic. The aim is to efficiently integrate different feature sets and tagging algorithms to synthesize more accurate POS tagging procedure. The data used in this work is the Arabic Quran Corpus, an annotated linguistic resource consisting of 77,430 words with Arabic grammar, syntax, and morphology for each word in the Holy Quran. The highest accuracy in the results achieved 98.32%, which can be a significant enhancement for the state-of-the-art for Arabic Quranic text. The most effective features that yield this accuracy are a combination of w􀬴 (the current word), po (POS of the current word), p_1(POS of three words before), p_2 (POS of two words before), and p_3 (POS of the word before).,Tesis ini tiada Perakuan Tesis Sarjana / Doktor Falsafah"
Pages:	98
Call Number:	TK7882.S65M334 2014 3
Publisher:	UKM, Bangi
URI:	https://ptsldigital.ukm.my/jspui/handle/123456789/462652
Appears in Collections:	Faculty of Science and Technology / Fakulti Sains dan Teknologi

Files in This Item:

File	Description	Size	Format
ukmvital_118709+SOURCE1+SOURCE1.0.PDF Restricted Access		792.02 kB	Adobe PDF	View/Open

Show full item record Recommend this item