Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/513244
Title: Semantic representation approach based on lexical knowledge sources for semantic relatedness measurement
Authors: Abdulgabbar Mohammed Saleh Saif (P60052)
Supervisor: Mohd Juzaiddin Ab Aziz, Assoc. Prof. Dr.
Keywords: Semantic representation
Text clustering
Semantic interpretation
Lexical knowledge
Dissertations, Academic -- Malaysia
Issue Date: 25-Jun-2015
Description: The Explicit Semantic Analysis (ESA) is a knowledge-based model that has received wide attention from researchers in the computational linguistics and the information retrieval fields. Based on the human organized language resources, ESA builds the semantic representation of the words depending on the textual definition of the concepts in the certain knowledge source. However, the representation vectors formed by ESA model are generally very excessive, high dimensional, and may contain many redundant concepts. Furthermore, the representation vector of a word is populated by conflating all the textual definitions (contexts) of the constituent, which ultimately is equivalent to conflating all of the different meanings of the ambiguous word. The main aim of this thesis is to propose a reduced dimension representation method that constructs the semantic interpretation of the words as vectors over the latent topics from the original ESA representation vectors. For modeling the latent topics, the Latent Dirichlet Allocation (LDA) is adapted to the ESA vectors for extracting the topics as probability distributions over the concepts rather than the words in the traditional model. The proposed method is applied to the wide knowledge sources used in the computational semantic analysis: WordNet and Wikipedia. On the English sources with high degree of the completeness, the proposed method is evaluated in two natural language processing tasks: measuring the semantic relatedness between words/texts and text clustering. The experimental results indicate that the reduced dimension representation method outperforms the baseline models in measuring the semantic relatedness and text clustering across several golden standard evaluation data sets. Moreover, on the text clustering task, the proposed method improved the performance of the clustering algorithm based on the conventional bag of words representation model in terms of the evaluation measures and the computational aspects. Since the knowledge-based methods depends mainly on the quality and quantity of the exploited knowledge sources, the non-English lexical sources with poor semantics such as Arabic cannot provide enough semantic evidence for addressing the ambiguity and synonymy issues in measuring the relatedness. To overcome the limitations of Arabic WordNet, the cross-lingual technique is proposed for mapping the synsets in Arabic WordNet to their corresponding concepts in Wikipedia. For evaluating this technique, Arabic mapping data set which contains 1,291 synset-article pairs is compiled. The proposed technique that utilized the cross-lingual features achieved the higher accuracy value (93.6%) than the accuracy values (ranged between 77.0% and 82.7%) of the state-of-the-art methods that depend only on the monolingual features. The experimental analysis shows that the leveraging of bilingual features is useful for improving the mapping task either for the synonymy or ambiguity issues.,Certification of Master's/Doctoral Thesis" is not available
Pages: 161
Publisher: UKM, Bangi
Appears in Collections:Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:
File Description SizeFormat 
ukmvital_83271+SOURCE1+SOURCE1.0.PDF
  Restricted Access
306.9 kBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.