Arabic named entity recognition using a rule based approach

Mohammed Abdullah Saleh Aboaoga (P53604)

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476123

Title:	Arabic named entity recognition using a rule based approach
Authors:	Mohammed Abdullah Saleh Aboaoga (P53604)
Supervisor:	Mohd Juzaiddin Ab Aziz, Dr
Keywords:	Arabic named Recognition Rule based approach Natural language processing (Computer science)
Issue Date:	11-May-2012
Description:	Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify atomic elements in texts into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, and percentages. NER is a very important preprocessing task for a variety of natural language processing applications, especially in the applications that require some degree of semantic interpretation such as machine translation, information retrieval, question-answering, word sense disambiguation, and text summarization. In this study, a rule-based approach has been applied to recognize the named entities in Arabic texts. Based on linguistic information, we investigate the characteristics of Arabic named entities in sport, economic and politic domains in order to build the rules of identifying the entities. Despite there are some rules for person names recognition, but they cannot handle the verb-particle constructions that contain the verb as introductory person verb list (IPVL) followed by one of the stop-word. Also there are some trigger words can be used to identify the named entities in any domains, but others of the trigger word can be used to identify the named entities in one domain only. The main goal of this study is to solve the specific problem of Arabic NER based on rule based technique. At the beginning, the heuristic rules have been generated according to Arabic grammar in order to handle the names. The current approach consists of three main steps: preprocessing, annotation, and named entity analyzer. The preprocessing step includes the normalization and the tokenization. The annotation step is to identify the named entities in texts that are found in Arabic gazetteers which consist of lists storing specific information such as people’s names, organization names, location names, and. After that, the named entity analyzer that uses the predefined rules and morphological information has been implemented to recognize the remained named entities in the text. In this step, the lists of trigger words that contain both introductory verbs and words have been used to select the suitable rule. Furthermore, morphological information is used to detect the part of speech of each word given to the morphological analyzer. Finally, the evaluation method that compares the results of the system with the manually annotated text has been applied in order to compute precision, recall, and f-measure. In the experiment of this work, the system has achieved the best f-measure values (92.66% for sport, 92.04% for economic, 90.43% for politic) in the named entity type person name. The average of f-measure values (89.73% for sport, 83.5% for economic, 84.65% for politic) in the location also clearly outperform the f-measure values (81.53% for sport, 80.85% for economic, 77.42% for politic) in the NE type organization. The performance measures value of the named entity is affected by the following factors: (the type of named entity, size of corpus, the linguistic preprocessing tools, and the size of gazetteers).,Master/Sarjana
Pages:	82
Call Number:	QA76.9.N38.A247 2012 3
Publisher:	UKM, Bangi
URI:	https://ptsldigital.ukm.my/jspui/handle/123456789/476123
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
ukmvital_73463+Source01+Source010.PDF Restricted Access		2.46 MB	Adobe PDF	View/Open

Show full item record Recommend this item