A Comparative Study on the Impact of Feature Selection and Dataset Resampling on the Performance of the K-Nearest Neighbors (KNN) Classification Algorithm
DOI:
https://doi.org/10.23887/janapati.v13i2.82174Abstract
This study aims to evaluate the impact of dataset balancing and feature selection on the performance of the K-Nearest Neighbors (KNN) classification algorithm. The primary objective is to determine the effect of different training data balance ratios on classification performance. Additionally, the study analyzes the contribution of feature selection methods and data balancing to the overall performance of the classification algorithm. Three datasets (Titanic, Wine Quality, and Heart Diseases) sourced from Kaggle, were utilized in this research. Following the preprocessing stage, the datasets were subjected to three resampling scenarios with balance ratios of 0.3, 0.6, and 0.9. Feature selection was performed by combining correlation test values and information gain values, each weighted at 50%. The selected features were those with positive combined values of summation, correlation, and information gain. The KNN classification algorithm was then applied to datasets with and without feature selection. The results indicate that achieving a perfectly balanced ratio (ratio = 1) is not essential for improving classification performance. A balance ratio of 0.6 yielded results comparable to those of a perfect balance ratio. Furthermore, the findings demonstrate that feature selection has a more significant impact on classification performance than data balancing. Specifically, data with a balance ratio of 0.3 and feature selection outperformed data with a balance ratio of 0.6 but without feature selection.
References
REFFRENCES
I. W. Dharmana, I. G. A. Gunadi, and L. J. E. Dewi, “Deteksi Transaksi Fraud Kartu Kredit Menggunankan Oversampling ADASYN dan Seleksi Fitur SVM-RFECV,” J. Teknol. Inf. dan Ilmu Komput., vol. 11, no. 1, pp. 125–134, 2024, doi: 10.25126/jtiik.20241117640.
J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, “A survey on addressing high-class imbalance in big data,” J. Big Data, vol. 5, no. 1, 2018, doi: 10.1186/s40537-018-0151-6.
A. Balla, M. H. Habaebi, E. A. A. Elsheikh, M. R. Islam, and F. M. Suliman, “The Effect of Dataset Imbalance on the Performance of SCADA Intrusion Detection Systems,” Sensors, vol. 23, no. 2, 2023, doi: 10.3390/s23020758.
J. T. Hancock, T. M. Khoshgoftaar, and J. M. Johnson, “Evaluating classifier performance with highly imbalanced Big Data,” J. Big Data, vol. 10, no. 1, 2023, doi: 10.1186/s40537-023-00724-5.
A. C. Muller and S. Guido, Introduction to Machine Learning with Python. 2023. doi: 10.2174/97898151244221230101.
I. M. Arya, A. Dwija, I. M. Gede, and I. G. Aris, “JTIM : Jurnal Teknologi Informasi dan Multimedia https://journal.sekawan-org.id/index.php/jtim/ Perbandingan Algoritma Naive Bayes Berbasis Feature Selection Gain Ratio dengan Naive Bayes Kovensional dalam Prediksi Komplikasi Hipertensi I Made Arya Adinat,” vol. 6, no. 1, pp. 37–49, 2024, [Online]. Available: https://doi.org/10.35746/jtim.v6i1.488
F. Septianingrum and A. S. Y. Irawan, “Metode Seleksi Fitur Untuk Klasifikasi Sentimen Menggunakan Algoritma Naive Bayes: Sebuah Literature Review,” J. Media Inform. Budidarma, vol. 5, no. 3, p. 799, 2021, doi: 10.30865/mib.v5i3.2983.
L. D. Utami et al., “Integrasi Metode Information Gain untuk Seleksi Fitur dan AdaBoost untuk Mengurangi Bias pada Analisis Sentimen Review Restoran Menggunakan Algoritma Naive Bayes,” J. Intell. Syst., vol. 1, no. 2, pp. 120–126, 2015.
J. Han and M. Kamber, Data Mining Concepts and Techniques, Secxond., vol. 23, no. 12. Newyork: Elsevier, 2022. doi: 10.1016/b978-0-08-100741-9.00012-7.
A. Indrawati, “Penerapan Teknik Kombinasi Oversampling Dan Undersampling Untuk Mengatasi Permasalahan Imbalanced Dataset,” JIKO (Jurnal Inform. dan Komputer), vol. 4, no. 1, pp. 38–43, 2021, doi: 10.33387/jiko.v4i1.2561.
W. Ustyannie and S. Suprapto, “Oversampling Method To Handling Imbalanced Datasets Problem In Binary Logistic Regression Algorithm,” IJCCS (Indonesian J. Comput. Cybern. Syst., vol. 14, no. 1, p. 1, 2020, doi: 10.22146/ijccs.37415.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, no. February, pp. 321–357, 2002, doi: 10.1613/jair.953.
S. Sofyan and A. Prasetyo, “Penerapan Synthetic Minority Oversampling Technique (SMOTE) Terhadap Data Tidak Seimbang Pada Tingkat Pendapatan Pekerja Informal Di Provinsi D.I. Yogyakarta Tahun 2019,” Semin. Nas. Off. Stat., vol. 2021, no. 1, pp. 868–877, 2021, doi: 10.34123/semnasoffstat.v2021i1.1081.
A. Y. Triyanto and R. Kusumaningrum, “Implementation of Sampling Techniques for Solving Imbalanced Data Problem in Determination of Toddler Nutritional Status using Learning Vector Quantization,” Jur. Ilmu Komputer/Informatika Univ. Diponegoro, vol. 19, no. 12, pp. 39–50, 2017.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 i gede aris gunadi, Dewi Oktofa Rachmawati
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with Janapati agree to the following terms:- Authors retain copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work. (See The Effect of Open Access)