Please use this identifier to cite or link to this item:
https://ptsldigital.ukm.my/jspui/handle/123456789/513239
Title: | Multi-label text categorization based on boosting algorithms |
Authors: | Bassam Mohammed Ahmed Al-Salemi (P61102) |
Supervisor: | Mohd. Juzaiddin Ab Aziz, Assoc. Prof. Dr. |
Keywords: | Multi-label text Boosting algorithm Multi-label classifiers Dissertations, Academic -- Malaysia |
Issue Date: | 31-May-2016 |
Description: | Multi-label text categorization is the task of automatically assigning the documents to a set of predefined labels based on their contents. The multi-label boosting algorithm AdaBoost.MH that adapted from the well-known boosting algorithm AdaBoost is one of the most accurate multi-label classifiers. Since its first appearance, AdaBoost.MH has received significant attention over the past several years and considered to be the state-of-the-art classifier for multi-label classification tasks. As a boosting algorithm, AdaBoost.MH combines the outputs of multiple simple classifiers called weak hypotheses into a powerful composite classifier called the final hypothesis. In each boosting round, AdaBoost.MH produces a set of weak hypotheses equivalent in size to the extracted features and only one hypothesis that leads to minimizing the Hamming loss is selected. This mechanism makes the computational time of AdaBoost.MH learning sensitive to the number of extracted features. Thus, this research aims at solving the computational complexity problem of AdaBoost.MH, as well as enhancing its classification performance for the multi-label text categorization. AdaBoost.MH learning time is linear with the number of the features. Therefore, using the typical text representation model Bag-Of-Words (BOW) will generate a vast number of features, and that will result in the increase of the learning computational cost. The straightforward solution of this matter can be tackled by employing an effective feature selection method to reduce the BOW features or indirectly by semantic clustering the single words for representing the texts. Accordingly, a framework of AdaBoost.MH learning using topic modeling was proposed in this research. In the proposed framework, dubbed “LDA-AdaBoost.MH”, the Latent Dirichlet Allocation (LDA) topic model is used to model the texts to a small set of latent topics, where each topic is a semantic cluster of words among the texts. The extracted latent topics are used for representing the texts as features for AdaBoost.MH learning. In addition, a supervised version of LDA, called Labeled LDA (LLDA), was investigated as a feature selection method. The experimental results on four benchmarks demonstrated that LLDA is an effective method for feature selection and led to the best performance comparing to three state-of-the-art methods. Moreover, the experimental results proved that using topics-based representation dramatically accelerated AdaBoost.MH learning and improved its classification performance. Furthermore, an improved version of AdaBoost.MH was proposed in this study, named as “Rank-and-Filter Boosting Algorithm” (RFBoost). The weak learning of RFBoost is based on filtering a small subset of ranked features to build a new weak hypothesis in each boosting round, rather than using all features like in AdaBoost.MH. The experimental results showed that RFBoost is an efficient and effective multi-label algorithm in comparison to the other boosting algorithms. The best experimental results reported overall in this study obtained by AdaBoost.MH when topics-based features were combined with the BOW features for representing the texts.,Certification of Master's/Doctoral Thesis" is not available |
Pages: | 183 |
Publisher: | UKM, Bangi |
Appears in Collections: | Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
ukmvital_83263+SOURCE1+SOURCE1.0.PDF Restricted Access | 17.52 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.