Please use this identifier to cite or link to this item:
https://ptsldigital.ukm.my/jspui/handle/123456789/513263
Title: | Multi-label document classification using class association rules with feature selection based on Pearson Correlation Coefficient |
Authors: | Roiss Mohammed Salem Alhutaish (P63455) |
Supervisor: | Nazlia Omar, Assoc. Prof. Dr. |
Keywords: | Multi-label document Classification Feature selection Pearson Correlation Coefficient Universiti Kebangsaan Malaysia -- Dissertations |
Issue Date: | 23-Feb-2017 |
Description: | With the exponential growth in the availability of online information and continuously increasing documents in digital form, there is a need to classify the multi-label documents which are increasingly required by modern applications. In a multi-label classification problem, each document is associated with a subset of labels. The documents often consist of multiple features. In addition, each document is usually associated with several labels. Thus, there are complex correlations between features with labels. Feature selection is an important task in machine learning, which attempts to remove irrelevant and redundant features that can hinder the performance. The main objectives of this thesis are to improve the performance of multi-label text classifiers by introducing new techniques to reduce the feature space and therefrom proposing a new adaptation algorithm for multi-label classification. A new filter method based on Pearson Correlation Coefficient is used to reduce the feature space, called feature selection based on Pearson Correlation Coefficient (FSPCC). To further reduce the subset feature, a wrapper approach is used for the first time, where the best threshold is determined based on minimum confidence of Class Association Rules (CARs). These methods are evaluated through two approaches, filter-wrapper approach and wrapper-filter approach. Each approach uses two stages; filter stage is represented through FSPCC and five traditional methods, while CAR represents wrapper stage. In addition, this research suggests transforming the multi-label document into single-label documents before using the feature selection algorithm. Under this process, the document is copied into labels to which it belongs by adopting assigning all features to each label it belongs. Two algorithm adaptations approaches are used. The first approach is the traditional Naive Bayes (NB) classifier which adapts in order to directly deal with multi-label documents. This approach uses a threshold in order to predict the labels of testing document. The second approach is the adaptation of Class Association Rules (CARs) and Naive Bayes (NB), called ML-CARNB. It adapts at two steps for multi-label classification. The first step uses CARs to determine the number of relevant and irrelevant features at each label. In the second step, the traditional Naive Bayes classifier adapts to deal with multi-label documents. It employs the principle of relevant or irrelevant label to predict the label of the unseen document. Experiments conducted with benchmark datasets showed that the Naïve Bayes Multi-label (NBML) classifier achieved a maximum average precision of 86.6% with FSPCC, and lower precisions with five traditional methods. The achieved average precisions of ML-CARNB, on the other hand, were overall better than those of NBML classifier when coupled with several feature selection methods. In particular, ML-CARNB based on FSPCC scored 93.4% of average precision. These results show that the proposed adaptation ML-CARNB classifier has better performance than NBML classifier in most of the cases, especially with the new FSPCC feature selection method.,Certification of Master's/Doctoral Thesis" is not available |
Pages: | 176 |
Publisher: | UKM, Bangi |
Appears in Collections: | Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
ukmvital_96606+SOURCE1+SOURCE1.0.PDF Restricted Access | 435.77 kB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.