Arabic part of speech disambiguation : a supervised stochastic morpheme-based approach

Mohamed Yahya Ali Albared (P45081)

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/513472

Title:	Arabic part of speech disambiguation : a supervised stochastic morpheme-based approach
Authors:	Mohamed Yahya Ali Albared (P45081)
Supervisor:	Nazlia Omar, Professor Dr.
Keywords:	Arabic Speech Stochastic morpheme-based approach Speech processing systems
Issue Date:	16-Jul-2011
Description:	Part of Speech (POS) disambiguation is the ability to computationally determine which POS of a word is activated by its use in a particular context. Arabic is a highly inflectional and morphologically rich language, which presents several challenges for POS tagging such as ambiguity and data sparseness, large existence of unknown words and fine-grained and large tag sets. Most POS tagging algorithms are either rule-based or stochastic. While rule-based methods require a large effort, stochastic taggers methods require large annotated corpora for each genre. The creation of such corpora is time consuming and labor intensive. With the lack of such large corpora, this dissertation describes the investigations we carried out in order to find out the best strategy to develop efficient and robust Arabic POS and morphological tagging models that require a small amount of tagged corpora. The baseline tagging models are based on a Hidden Markov Model (HMM), namely, Bigram HMM and Trigram HMM are investigated. Several dynamic smoothing techniques are used with HMM models to overcome the data sparseness problem. This research also presents new methodologies to manage the problem of unknown word POS tagging in Arabic. Firstly, this work designs, implements and empirically evaluates several language independent lexical models based on word affixes probabilities. Secondly, new statistical integrated tagging models are introduced to provide adaptive and transportable tagging scheme. Thirdly, to deal with non-concatenative nature of Arabic word, a new statistical light-pattern based unknown word handler is introduced. This work also studies the influence of the tokenization level on the tagging performance of the tagging models, in term of accuracy and time complexity, in order to determine the best tokenization choice for POS tagging. Finally, to deal with more fine-grained POS tag set, this work presents the morpheme-based Arabic morphological disambiguator, which consists of several morpheme-based N-attributes stochastic classifiers and a module which combines them. Two Arabic small corpora from different genres are used for evaluation, the FUS-HA corpus and the Quranic Arabic Corpus. The best POS tagging result achieved by the new tagging models is 94.5% for unknown word tagging and 96.5% for the overall tagging. Experimental results also show that the new tagging models significantly improve the overall tagging accuracy over the baseline models and perform better than existing Arabic systems on 15 test sets from 15 different genres. In addition, results show that morpheme-based tagging models are more efficient and accurate than word-based models. Finally, our N-attributes stochastic classifier combination model provides morphological tagging with overall accuracy of 91.5% and saves run time over the direct classification approaches.,Ph.D
Pages:	234
Call Number:	TK7882.S65.A434 2012 3
Publisher:	UKM, Bangi
URI:	https://ptsldigital.ukm.my/jspui/handle/123456789/513472
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
Arabic part of speech disambiguation : a supervised stochastic morpheme-based approach.pdf Restricted Access	Full text	3.93 MB	Adobe PDF	View/Open

Show full item record Recommend this item