Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/783262
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorRabiah Abdul Kadir, Assoc. Prof. Dr.en_US
dc.contributor.advisorMohamad Taha Ijab, Dr.en_US
dc.contributor.authorBashar Hamad Aubaidan (P103708)en_US
dc.date.accessioned2026-05-07T07:23:32Z-
dc.date.available2026-05-07T07:23:32Z-
dc.date.issued2026-02-24-
dc.identifier.urihttps://ptsldigital.ukm.my/jspui/handle/123456789/783262-
dc.description.abstractDiabetes is a chronic metabolic condition marked by persistently elevated blood glucose levels and influenced by multiple physiological and lifestyle factors, making early prediction challenging. Reliable predictive modelling requires high-quality datasets; however, issues such as missing values, class imbalance, and redundant or high- dimensional attributes often reduce model accuracy and generalisability. To address these limitations, this study introduces an integrated data-quality enhancement framework combining Bidirectional Neighbour Graph (BNG) imputation for missing data, Clustering Selection Synthesis Filtering (CSSF) for class imbalance, and Rough Set Theory (RST) for dimensionality reduction and feature selection. The proposed BNG–CSSF–RST framework was applied to three independent datasets: the Pima Indians Diabetes Dataset from the UCI Repository, the Diabetes Clinical Dataset (Kaggle, 2024), and the Kaggle Diabetes Prediction Dataset (2023). After preprocessing with the proposed framework, both Artificial Neural Network (ANN) and Support Vector Machine (SVM) models were trained and evaluated. Across all datasets, the framework yielded notable improvements in predictive performance. Using the Pima dataset (768 records, 9 features), ANN achieved 93.51% accuracy, while SVM reached 90.26%. For the Diabetes Clinical Dataset (100,000 records, 17 features), ANN obtained 96.95% accuracy and SVM achieved 96.32%. On the Kaggle Diabetes Prediction Dataset (100,000 records, 9 features), ANN attained 91.49% accuracy, and SVM achieved 89.12%. Overall, the results indicate that systematically addressing missing data, class imbalance, and irrelevant or redundant features substantially improves classification performance. The enhanced accuracies observed across all three datasets exceed those typically reported in earlier studies, confirming the robustness of the proposed BNG–CSSF–RST framework. Finally, this approach provided a methodology for diabetes prediction specifically designed to address missing data, class imbalance, and dimensionality reduction, thereby enhancing overall data quality and enabling more robust analysis of complex datasets.en_US
dc.language.isoenen_US
dc.publisherUKM, Bangien_US
dc.relationInstitute of IR4.0 / Institut IR4.0 (IIR4.0)en_US
dc.rightsUKMen_US
dc.subjectDiabetes -- Diagnosis -- Data processingen_US
dc.subjectMachine learningen_US
dc.subjectData miningen_US
dc.subjectUniversiti Kebangsaan Malaysia -- Dissertationsen_US
dc.subjectDissertations, Academic -- Malaysiaen_US
dc.titleImproving the quality of dataset for diabetes prediction using a machine learning approachen_US
dc.typeThesesen_US
dc.description.notese-tesisen_US
dc.format.pages225en_US
dc.format.degreePh.Den_US
dc.description.categoryofthesesAccess Terbuka/Open Accessen_US
Appears in Collections:Institute of Visual Informatics/ Institut Informatik Visual (IVI)

Files in This Item:
There are no files associated with this item.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.