Similar Questions Identification on Indonesian Language Subject  Using Machine Learning

Hasmawati; Ade Romadhony

doi:10.23887/janapati.v12i2.62582

Authors

Hasmawati Telkom University
Ade Romadhony

DOI:

https://doi.org/10.23887/janapati.v12i2.62582

Keywords:

Question Similarity, Support Vector Machine, IndoBERT, Cosine Similarity, POS Tag

Abstract

Question similarity is carried out to evaluate similarities between questions in a collection of questions in the question and answer forum and on other platforms. This is done to improve the performance of the question-and-answer forum so that new questions submitted by users can be identified as similar to existing questions in the database. Currently, research related to question similarity is still being carried out on foreign language datasets. The purpose of this research is to identify the similarity of questions in a collection of questions in Indonesian. The method used is Support Vector Machine and IndoBERT. For feature extraction, we evaluate the lexical features and syntax features of each question. For lexical feature extraction, we use the cosine similarity algorithm to calculate the distance between two objects which are represented as vectors. For syntax feature extraction we use the Indonesian part of speech tagger (POS Tag). The dataset used is a collection of questions on Indonesian subjects at the primary and secondary school levels. The results of this study show that the best performance of the Support Vector Machine is obtained from the use of the cosine similarity feature with an accuracy of 85%. While the use of the POS Tag feature or the combination of POS Tag and cosine similarity causes the model to be overfitted and the accuracy decreases to 77%. Meanwhile, for the IndoBERT model, an accuracy of 95% was obtained.

References

M. Al-Asa’d, N. Al-Khdour, M. B. Younes, E. Khwaileh, M. Hammad, and M. AL-Smadi, “Question to Question Similarity Analysis using Morphological, Syntactic, Semantic, and Lexical Features,” in 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), IEEE, Nov. 2019, pp. 1–6. doi: 10.1109/AICCSA47632.2019.9035248.

V.-T. Nguyen, A.-C. Le, and H.-N. Nguyen, “A Model of Convolutional Neural Network Combined with External Knowledge to Measure the Question Similarity for Community Question Answering Systems,” Int J Mach Learn Comput, vol. 11, no. 3, pp. 194–201, May 2021, doi: 10.18178/ijmlc.2021.11.3.1035.

Y. Yulin and Z. Guiyun, “High school math text similarity studies based on CNN and BiLSTM,” in 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE), IEEE, Dec. 2020, pp. 1982–1986. doi: 10.1109/ICMCCE51767.2020.00434.

D. V. Vekariya and N. R. Limbasiya, “A Novel Approach for Semantic Similarity Measurement for High Quality Answer Selection in Question Answering using Deep Learning Methods,” in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), IEEE, Mar. 2020, pp. 518–522. doi: 10.1109/ICACCS48705.2020.9074471.

Y. Chali and R. Islam, “Question-Question Similarity in Online Forums,” in Proceedings of the 10th Annual Meeting of the Forum for Information Retrieval Evaluation, New York, NY, USA: ACM, Dec. 2018, pp. 21–28. doi: 10.1145/3293339.3293345.

T.-T. Ha, V.-N. Nguyen, K.-H. Nguyen, K.-A. Nguyen, and Q.-K. Than, “Utilizing SBERT For Finding Similar Questions in Community Question Answering,” in 2021 13th International Conference on Knowledge and Systems Engineering (KSE), IEEE, Nov. 2021, pp. 1–6. doi: 10.1109/KSE53942.2021.9648830.

K. M. Shivani and M. R. Aswathy, “Study on Techniques for Analyzing Semantic Similarity in Question Answering System,” in 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), IEEE, May 2018, pp. 633–636. doi: 10.1109/ICOEI.2018.8553832.

Z. Wang et al., “Match2,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA: ACM, Jul. 2020, pp. 559–568. doi: 10.1145/3397271.3401143.

W. T. Alshammari and S. AlHumoud, “TAQS: An Arabic Question Similarity System Using Transfer Learning of BERT With BiLSTM,” IEEE Access, vol. 10, pp. 91509–91523, 2022, doi: 10.1109/ACCESS.2022.3198955.

F. Kunneman, T. Castro Ferreira, E. Krahmer, and A. van den Bosch, “Question Similarity in Community Question Answering: A Systematic Exploration of Preprocessing Methods and Models,” in Proceedings - Natural Language Processing in a Deep Learning World, Incoma Ltd., Shoumen, Bulgaria, Oct. 2019, pp. 593–601. doi: 10.26615/978-954-452-056-4_070.

N. Othman, R. Faiz, and K. Smaïli, “Learning English and Arabic question similarity with Siamese Neural Networks in community question answering services,” Data Knowl Eng, vol. 138, p. 101962, Mar. 2022, doi: 10.1016/j.datak.2021.101962.

F. Rashel, A. Luthfi, A. Dinakaramani, and R. Manurung, “Building an Indonesian rule-based part-of-speech tagger,” in 2014 International Conference on Asian Language Processing (IALP), IEEE, Oct. 2014, pp. 70–73. doi: 10.1109/IALP.2014.6973521.

R. A. Hidayat, I. N. Khasanah, W. C. Putri, and R. Mahendra, “Feature-Rich Classifiers for Recognizing Textual Entailment in Indonesian,” Procedia Comput Sci, vol. 189, pp. 148–155, 2021, doi: 10.1016/j.procs.2021.05.094.

B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 843–857, Dec. 2020.

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” in Proceedings of the 28th International Conference on Computational Linguistics, Stroudsburg, PA, USA: International Committee on Computational Linguistics, 2020, pp. 757–770. doi: 10.18653/v1/2020.coling-main.66.