Improving the quality of dataset for diabetes prediction using a machine learning approach

Bashar Hamad Aubaidan

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/783262

Title:	Improving the quality of dataset for diabetes prediction using a machine learning approach
Authors:	Bashar Hamad Aubaidan
Supervisor:	Rabiah Abdul Kadir, Assoc. Prof. Dr. Mohamad Taha Ijab, Dr.
Keywords:	Diabetes -- Diagnosis -- Data processing Machine learning Data mining Universiti Kebangsaan Malaysia -- Dissertations Dissertations, Academic -- Malaysia
Issue Date:	24-Feb-2026
Abstract:	Diabetes is a chronic metabolic condition marked by persistently elevated blood glucose levels and influenced by multiple physiological and lifestyle factors, making early prediction challenging. Reliable predictive modelling requires high-quality datasets; however, issues such as missing values, class imbalance, and redundant or high- dimensional attributes often reduce model accuracy and generalisability. To address these limitations, this study introduces an integrated data-quality enhancement framework combining Bidirectional Neighbour Graph (BNG) imputation for missing data, Clustering Selection Synthesis Filtering (CSSF) for class imbalance, and Rough Set Theory (RST) for dimensionality reduction and feature selection. The proposed BNG–CSSF–RST framework was applied to three independent datasets: the Pima Indians Diabetes Dataset from the UCI Repository, the Diabetes Clinical Dataset (Kaggle, 2024), and the Kaggle Diabetes Prediction Dataset (2023). After preprocessing with the proposed framework, both Artificial Neural Network (ANN) and Support Vector Machine (SVM) models were trained and evaluated. Across all datasets, the framework yielded notable improvements in predictive performance. Using the Pima dataset (768 records, 9 features), ANN achieved 93.51% accuracy, while SVM reached 90.26%. For the Diabetes Clinical Dataset (100,000 records, 17 features), ANN obtained 96.95% accuracy and SVM achieved 96.32%. On the Kaggle Diabetes Prediction Dataset (100,000 records, 9 features), ANN attained 91.49% accuracy, and SVM achieved 89.12%. Overall, the results indicate that systematically addressing missing data, class imbalance, and irrelevant or redundant features substantially improves classification performance. The enhanced accuracies observed across all three datasets exceed those typically reported in earlier studies, confirming the robustness of the proposed BNG–CSSF–RST framework. Finally, this approach provided a methodology for diabetes prediction specifically designed to address missing data, class imbalance, and dimensionality reduction, thereby enhancing overall data quality and enabling more robust analysis of complex datasets.
Notes:	e-tesis
Pages:	225
Publisher:	UKM, Bangi
URI:	https://ptsldigital.ukm.my/jspui/handle/123456789/783262
Appears in Collections:	Institute of Visual Informatics/ Institut Informatik Visual (IVI)

Files in This Item:

There are no files associated with this item.

Show full item record Recommend this item