A Comparative Study on the Impact of Feature Selection and Dataset Resampling on the Performance of the K-Nearest Neighbors (KNN) Classification Algorithm

I Gede Aris Gunadi; Dewi Oktofa Rachmawati

doi:10.23887/janapati.v13i2.82174

Authors

I Gede Aris Gunadi Universitas Pendidikan Ganesha
Dewi Oktofa Rachmawati Universitas Pendidikan Ganesha

DOI:

https://doi.org/10.23887/janapati.v13i2.82174

Abstract

This study aims to evaluate the impact of dataset balancing and feature selection on the performance of the K-Nearest Neighbors (KNN) classification algorithm. The primary objective is to determine the effect of different training data balance ratios on classification performance. Additionally, the study analyzes the contribution of feature selection methods and data balancing to the overall performance of the classification algorithm. Three datasets (Titanic, Wine Quality, and Heart Diseases) sourced from Kaggle, were utilized in this research. Following the preprocessing stage, the datasets were subjected to three resampling scenarios with balance ratios of 0.3, 0.6, and 0.9. Feature selection was performed by combining correlation test values and information gain values, each weighted at 50%. The selected features were those with positive combined values of summation, correlation, and information gain. The KNN classification algorithm was then applied to datasets with and without feature selection. The results indicate that achieving a perfectly balanced ratio (ratio = 1) is not essential for improving classification performance. A balance ratio of 0.6 yielded results comparable to those of a perfect balance ratio. Furthermore, the findings demonstrate that feature selection has a more significant impact on classification performance than data balancing. Specifically, data with a balance ratio of 0.3 and feature selection outperformed data with a balance ratio of 0.6 but without feature selection.

References

REFFRENCES

I. W. Dharmana, I. G. A. Gunadi, and L. J. E. Dewi, “Deteksi Transaksi Fraud Kartu Kredit Menggunankan Oversampling ADASYN dan Seleksi Fitur SVM-RFECV,” J. Teknol. Inf. dan Ilmu Komput., vol. 11, no. 1, pp. 125–134, 2024, doi: 10.25126/jtiik.20241117640.

J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya, “A survey on addressing high-class imbalance in big data,” J. Big Data, vol. 5, no. 1, 2018, doi: 10.1186/s40537-018-0151-6.

A. Balla, M. H. Habaebi, E. A. A. Elsheikh, M. R. Islam, and F. M. Suliman, “The Effect of Dataset Imbalance on the Performance of SCADA Intrusion Detection Systems,” Sensors, vol. 23, no. 2, 2023, doi: 10.3390/s23020758.

J. T. Hancock, T. M. Khoshgoftaar, and J. M. Johnson, “Evaluating classifier performance with highly imbalanced Big Data,” J. Big Data, vol. 10, no. 1, 2023, doi: 10.1186/s40537-023-00724-5.

A. C. Muller and S. Guido, Introduction to Machine Learning with Python. 2023. doi: 10.2174/97898151244221230101.

I. M. Arya, A. Dwija, I. M. Gede, and I. G. Aris, “JTIM : Jurnal Teknologi Informasi dan Multimedia https://journal.sekawan-org.id/index.php/jtim/ Perbandingan Algoritma Naive Bayes Berbasis Feature Selection Gain Ratio dengan Naive Bayes Kovensional dalam Prediksi Komplikasi Hipertensi I Made Arya Adinat,” vol. 6, no. 1, pp. 37–49, 2024, [Online]. Available: https://doi.org/10.35746/jtim.v6i1.488

F. Septianingrum and A. S. Y. Irawan, “Metode Seleksi Fitur Untuk Klasifikasi Sentimen Menggunakan Algoritma Naive Bayes: Sebuah Literature Review,” J. Media Inform. Budidarma, vol. 5, no. 3, p. 799, 2021, doi: 10.30865/mib.v5i3.2983.

L. D. Utami et al., “Integrasi Metode Information Gain untuk Seleksi Fitur dan AdaBoost untuk Mengurangi Bias pada Analisis Sentimen Review Restoran Menggunakan Algoritma Naive Bayes,” J. Intell. Syst., vol. 1, no. 2, pp. 120–126, 2015.

J. Han and M. Kamber, Data Mining Concepts and Techniques, Secxond., vol. 23, no. 12. Newyork: Elsevier, 2022. doi: 10.1016/b978-0-08-100741-9.00012-7.

A. Indrawati, “Penerapan Teknik Kombinasi Oversampling Dan Undersampling Untuk Mengatasi Permasalahan Imbalanced Dataset,” JIKO (Jurnal Inform. dan Komputer), vol. 4, no. 1, pp. 38–43, 2021, doi: 10.33387/jiko.v4i1.2561.

W. Ustyannie and S. Suprapto, “Oversampling Method To Handling Imbalanced Datasets Problem In Binary Logistic Regression Algorithm,” IJCCS (Indonesian J. Comput. Cybern. Syst., vol. 14, no. 1, p. 1, 2020, doi: 10.22146/ijccs.37415.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, no. February, pp. 321–357, 2002, doi: 10.1613/jair.953.

S. Sofyan and A. Prasetyo, “Penerapan Synthetic Minority Oversampling Technique (SMOTE) Terhadap Data Tidak Seimbang Pada Tingkat Pendapatan Pekerja Informal Di Provinsi D.I. Yogyakarta Tahun 2019,” Semin. Nas. Off. Stat., vol. 2021, no. 1, pp. 868–877, 2021, doi: 10.34123/semnasoffstat.v2021i1.1081.

A. Y. Triyanto and R. Kusumaningrum, “Implementation of Sampling Techniques for Solving Imbalanced Data Problem in Determination of Toddler Nutritional Status using Learning Vector Quantization,” Jur. Ilmu Komputer/Informatika Univ. Diponegoro, vol. 19, no. 12, pp. 39–50, 2017.