Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/513438
Title: A multi-label text classification based on graph unified information propagation and dynamic multi-sample feature selection
Authors: Adil Yaseen Taha (P89062)
Supervisor: Sabrina Tiun, Dr.
Keywords: Universiti Kebangsaan Malaysia -- Dissertations
Dissertations, Academic -- Malaysia
Graph unified information
Binary-coded decimal system
Issue Date: 29-Jun-2021
Description: Multi-label text classification has become progressively more important in recent years, where each document can have more than one labels, concurrently. Multi-label text classification is a challenging task because of the large space of all potential label sets, which is exponential to the number of candidate labels. Multi-label datasets contain several complexities which degrade the performance of classifiers. The current stateof- the-art of multi-label classification models suffers from low performance due to missing labels, high class-imbalance and high-dimensionality. A missing labels of multi-label classification, classification performance reduces due to a significant problem of multi-label learning with missing labels or incomplete labels; This is mostly due to trained instances having incomplete/partial set of labels. In addition, classimbalance problem is when one set of classes dominate over another set of classes and causes severed skewed class distribution. In a skewed class distribution, most of the classification models focus on the major sample while ignoring or misclassifying minority samples. The minority samples are those that rarely occur, but they are important samples as well. Ignoring minority samples causes the poor performance of traditional machine learning models that work perfectly only on a balanced class distribution. Moreover, in the multi-label text learning process, there is a significant number of irrelevant, redundant, and disruptive information. The number of involved features is usually large. The high dimensionality of multi-label text data results in challenges such as poor performance, over-fitting, and anything from computational to classification complexity. Therefore, this research aims to design and enhance the performance of multi-label text classification by tackling all the mentioned problems. First, to handle the problem of missing labels, this research proposes a new model named as graph-based unified information propagation for multi-label missing labels problem (UG-MLP). UG-MLP handles multi-label missing labels by constructing a mixed graph, which jointly incorporates instance-level feature space-based similarity and labels distribution-based similarity and accurate label correlations. To solve the problem of class-imbalance, this research proposes a new model named as multi-label over-sampling and under-sampling and class alignment (ML-OUSCA). ML-OUSCA is a new model that balances the training examples classes by joining the over-sampling, under-sampling, and non-sampling. In addition, to solve the high dimensionality problem, multi-label feature selection techniques that subtract irrelevant features and transforming high-dimensional documents to low-dimensional is required. Thus, this study proposed a new dynamic multi-label two layers of mutual information and clustering-based ensemble feature selection (DMMC-EFS) to select only the useful and most relevant features. All of the proposed models were evaluated on two most popular models in multi-label learning; AdaBoost and Chain Classifier, using standard multilabel text classification datasets; Reuters-21578, Bibtex, and Enron. The obtained results indicate all the proposed models outperform the baseline models (UG-MLP, ML-OUSCA and DMMC-EFS with 86.52%, 88.20% and 91.79%, respectively in terms of F-measure). In conclusion, all the proposed models contribute to the improvement of the multi-label text classification.,Ph.D
Pages: 228
Publisher: UKM, Bangi
Appears in Collections:Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:
File Description SizeFormat 
ukmvital_130624+Source01+Source010.PDF
  Restricted Access
2.34 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.