Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/513472
Title: Arabic part of speech disambiguation : a supervised stochastic morpheme-based approach
Authors: Mohamed Yahya Ali Albared (P45081)
Supervisor: Nazlia Omar, Professor Dr.
Keywords: Arabic
Speech
Stochastic morpheme-based approach
Speech processing systems
Issue Date: 16-Jul-2011
Description: Part of Speech (POS) disambiguation is the ability to computationally determine which POS of a word is activated by its use in a particular context. Arabic is a highly inflectional and morphologically rich language, which presents several challenges for POS tagging such as ambiguity and data sparseness, large existence of unknown words and fine-grained and large tag sets. Most POS tagging algorithms are either rule-based or stochastic. While rule-based methods require a large effort, stochastic taggers methods require large annotated corpora for each genre. The creation of such corpora is time consuming and labor intensive. With the lack of such large corpora, this dissertation describes the investigations we carried out in order to find out the best strategy to develop efficient and robust Arabic POS and morphological tagging models that require a small amount of tagged corpora. The baseline tagging models are based on a Hidden Markov Model (HMM), namely, Bigram HMM and Trigram HMM are investigated. Several dynamic smoothing techniques are used with HMM models to overcome the data sparseness problem. This research also presents new methodologies to manage the problem of unknown word POS tagging in Arabic. Firstly, this work designs, implements and empirically evaluates several language independent lexical models based on word affixes probabilities. Secondly, new statistical integrated tagging models are introduced to provide adaptive and transportable tagging scheme. Thirdly, to deal with non-concatenative nature of Arabic word, a new statistical light-pattern based unknown word handler is introduced. This work also studies the influence of the tokenization level on the tagging performance of the tagging models, in term of accuracy and time complexity, in order to determine the best tokenization choice for POS tagging. Finally, to deal with more fine-grained POS tag set, this work presents the morpheme-based Arabic morphological disambiguator, which consists of several morpheme-based N-attributes stochastic classifiers and a module which combines them. Two Arabic small corpora from different genres are used for evaluation, the FUS-HA corpus and the Quranic Arabic Corpus. The best POS tagging result achieved by the new tagging models is 94.5% for unknown word tagging and 96.5% for the overall tagging. Experimental results also show that the new tagging models significantly improve the overall tagging accuracy over the baseline models and perform better than existing Arabic systems on 15 test sets from 15 different genres. In addition, results show that morpheme-based tagging models are more efficient and accurate than word-based models. Finally, our N-attributes stochastic classifier combination model provides morphological tagging with overall accuracy of 91.5% and saves run time over the direct classification approaches.,Ph.D
Pages: 234
Call Number: TK7882.S65.A434 2012 3
Publisher: UKM, Bangi
Appears in Collections:Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:
File Description SizeFormat 
ukmvital_74647+Source01+Source010.PDF
  Restricted Access
3.93 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.