Part of Speech (POS) tagging for Malay language using supervised machine learning models

Shamsan Khaled Salem Gaber

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476231

Title:	Part of Speech (POS) tagging for Malay language using supervised machine learning models
Authors:	Shamsan Khaled Salem Gaber
Supervisor:	Mohd Zakree Bin Ahmed Nazree, Prof. Dr.
Keywords:	Part of speech Computational linguistics
Issue Date:	8-Jan-2015
Description:	Part of Speech (POS) tagging is the process of assigning the appropriate part of speech or word category to each word in a corpus. Part of speech tagging is the main process of Natural Language Processing (NLP) tasks. Malay language is one of the languages which are widely used in Malaysia, Brunei, Singapore and Indonesia. It is considered as one of the most challenging language because it’s morphological richness. However, limited research is made available on POS tagging for Malay Language. Lack of linguistic tools and limited to computational resources discourage researches from conduction further investigation on Malay Language. In this thesis, POS tagging for Malay Language has been presented using supervised Machine Learning Models. The corpus obtained from Dewan Bahasa dan Pusataka (DBP). DBP has defined tagset (categories) which is most acceptable. Moreover, DBP is the highest government authorized body concerning on Bahasa Melayu. The size of the corpus is 115,000 tokens; however, the study have used a sub corpus from the main corpus for our experiments which is sized by 20,000 tokens. Feature Selection is the process of selecting the important features (attributes) which are affecting to the process of categorization. Different type of features have been extracted from the corpus such as word features, POS features fix features. In this study focused affix feature selection for the bigram Hidden Markov Model (HMM) and word features, POS features and affix features for Naive Bayes (NB) and K Nearest Neighbor (K-NN) to investigate the strengths and weaknesses of feature sets. Bigram Hidden Markov Model (HMM), Naive Bayes (NB) and K Nearest Neighbor (K-NN) have been applied on the sub corpus. The results showed that the bigram (HMM) model yields the highest accuracy (89.93%) with three circumfix. NB yields the highest accuracy (90.08%) with four circumfix, two previous POS tag and five words features. K-NN yields highest accuracy (85.95%) with word itself, previous word the following word features when K=1. The study conclude that, among the models that been used, (NB) model achieved the highest accuracy.,Master/Sarjana
Pages:	73
Call Number:	QA76.9.N38G334 2015 3 tesis
Publisher:	UKM, Bangi
URI:	https://ptsldigital.ukm.my/jspui/handle/123456789/476231
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
ukmvital_76528+SOURCE1+SOURCE1.0.PDF Restricted Access		1.2 MB	Adobe PDF	View/Open

Show full item record Recommend this item