SIMILAR QUESTIONS IDENTIFICATION ON INDONESIAN LANGUAGE SUBJECTS USING MACHINE LEARNING

Question similarity is carried out to evaluate similarities between questions in a collection of questions in the question and answer forum and on other platforms. This is done to improve the performance of the question-and-answer forum so that new questions submitted by users can be identified as similar to existing questions in the database. Currently, research related to question similarity is still being carried out on foreign language datasets. The purpose of this research is to identify the similarity of questions in a collection of questions in Indonesian. The method used is Support Vector Machine and IndoBERT. For feature extraction, we evaluate the lexical features and syntax features of each question. For lexical feature extraction, we use the cosine similarity algorithm to calculate the distance between two objects which are represented as vectors. For syntax feature extraction we use the Indonesian part of speech tagger (POS Tag). The dataset used is a collection of questions on Indonesian subjects at the primary and secondary school levels. The results of this study show that the best performance of the Support Vector Machine is obtained from the use of the cosine similarity feature with an accuracy of 85%. While the use of the POS Tag feature or the combination of POS Tag and cosine similarity causes the model to be overfitted and the accuracy decreases to 77%. Meanwhile, for the IndoBERT model, an accuracy of 95% was obtained.


INTRODUCTION
In the question and answer forums such as Stack Overflow, Quora, Yahoo, etc., often new questions submitted have already been asked, so the answers given should be the same as answers to similar questions already stored in the database. To improve the performance of the question and answer forum, the identification of similar questions is one possible solution. The aim is to identify whether the new questions given by the user are similar to the existing questions in the database so that the question-and-answer system can provide answers quickly.
Research related to question similarity has been developed using various approaches, one of which is using machine learning with various types of feature extraction methods according to the characteristics of the language being evaluated. One of them is the research conducted by Muntaka Al-asa'd et al [1], where they proposed a method of predicting the similarity of questions by extracting morphological, syntactic, semantic, and lexical features in a dataset of questions in Arabic. The approach taken involves several processes including preprocessing for Arabic text, feature extraction, and text classification. The dataset used is a collection of questions in Arabic with a total of 4000 pairs of questions. The method used in this research is Extreme Gradient Boosting (XGB) and feature selection with Random Forest. The performance of the method is evaluated by calculating the values for accuracy, precision, recall, and F1 score. This study succeeded in classifying questions with an accuracy of 78.2%.
Another study was conducted by [2]. In this research, they implemented the convolutional neural network to measure the similarity of questions on community question-answering systems. In this research, they used SemEval 2016 dataset and implement different feature extraction. The results of this study indicate that the combination of CNN with external knowledge gives the best results.
Another approach using deep learning was conducted by [3], [4]. In research [4], they combine 2 methods, namely Versatile Global Tmax pooling which is used to predict the subsequent word in the data collection, and DeepLSTM which is used for predicting the best answers. The combination of these methods gives good performance.
Another technique for solving the similarity task question was also proposed by [3], [5]- [8]. In research conducted by [8], they took an approach by utilizing the answers from the questions as a bridge between the 2 questions. They compared the patterns of the 2 questionanswer pairs to identify similar questions. The dataset used in this research is a collection of questions from the Q&A forum CQADupStack and QuoraQP-a. In its implementation, they made 3 modules, the representation-based similarity module to predict the similarity vectors of 2 questions. The second is the matching pattern module which uses the Siamese Network to compare the matching patterns of 2 questions based on the same answers. The third module is the aggregation module which combines the similarity vectors of the two previous modules. Their experiments show that the proposed model works significantly and outperforms previous models.
The difference between this research and previous research is the previous research was carried out on a dataset of questions in foreign languages such as Arabic and English [9]- [11] and also the different methods used. Therefore, the purpose of this research is to identify the similarity of questions in a collection of questions in Indonesian. Another contribution to this research is that we built a labeled dataset for pairs of questions in Indonesian. In this study, we used a machine learning approach including Support Vector Machine (SVM) and pre-trained IndoBERT to predict the similarity of new questions to existing questions. For the question features, we evaluate the lexical features and syntax of each question. The reason for selecting lexical and syntactic features is based on previous research [1] which obtained good performance when using these features. In addition, the selection of this feature is also based on the availability of Indonesian language processing tools that are available and open-source accessible. The dataset used in this study is a collection of questions on Indonesian subjects at the elementary and secondary school levels.

METHOD
The flowchart of this research can be seen In Figure 1.

Corpus Construction
The first step in this research is collecting datasets and labeling datasets. The data collected is a collection of Indonesian language subject questions at the primary and secondary school levels. An example of the dataset used can be seen in Table 1.
The next step is to label the dataset. For each pair of questions, we label "yes" if the pair of questions are similar and label "no" otherwise. There were 622 pairs of questions consisting of 407 pairs of questions labeled "yes" and 215 pairs of questions labeled "no". To do the dataset labeling, we involved 3 undergraduate students. An example of a dataset of questions that have been labeled based on their similarity can be seen in Table 2. In this study, we did not preprocess the dataset, such as stopword removal, stemming, etc., because removing common words in a question would change the context of the question. So that it can affect the prediction of the similarity of the questions.

Feature Extraction
We perform lexical and syntax feature extraction. To extract the lexical features we calculate the cosine similarity between the two pairs of questions using the cosine similarity algorithm from scikit-learn. Cosine Similarity is used to measure the similarity between pairs of questions represented in vectors by calculating the cosine value. The formula for calculating the cosine similarity value of two pairs of questions is shown in Equation 1 [1].
Where A and B are question pairs. To extract the syntax features we use Indonesian language POS tagging by adopting the POS Tag extraction stages in the research of Rani Aulia et al [12], [13]. Table 3 shows a sample of feature extraction result.

Question Similarity Model
To build a question similarity identification model, we used two algorithms; a pre-trained model indoBERT and Support Vector Machine.
1) Support Vector Machine (SVM) is a supervised algorithm. In the case of text classification, the algorithm divides the data into two classes using a vector line called a hyperplane. For implementing the SVM algorithms we use Python and the library scikit-learn. For the tuning parameter, we used kernel rbf, regularization parameter C is 1.0 and gamma is auto. 2) IndoBERT is a monolingual BERT model for Indonesian. IndoBERT has 3 models; IndoBERT-liteBase, IndoBERTBase, IndoBERTLarge. In this research, we implement the pre-trained indoBERT BASE p1 proposed by B. Wilie [14] and Fajri [15] that was pretrained on Indo4B Indonesian corpus.

Evaluation
To evaluate the model performance, we measure accuracy, precision, recall, and f1 score. We conducted an experiment to see the performance of the SVM and indoBERT algorithms with a combination of feature extraction; 1) Syntax features only (POS Tag), 2) Lexical features only (Cosine Similarity), 3) A combination of POS Tag features and cosine similarity.
In implementing the algorithm, we split the dataset into 3 parts; 80% data for training, 10% for validation, and 10% for testing.

RESULT AND DISCUSSION
Based on the experiment scenario mentioned above, we conducted an experiment to assess the performance of the SVM and indoBERT classification algorithms with a combination of features 1) syntax features only (POS Tag), 2) lexical features only (Cosine Similarity), 3) a combination of POS Tag and cosine similarity features.
To evaluate the classification model, we calculate the accuracy, precision, recall, and F1 score using the confusion matrix.
1) Accuracy is the ratio of Correct predictions (positive and negative) to the entire data. Accuracy, in this case, is "how many percent of the questions are predicted to be similar and not similar from all questions".   Based on Table 4, it can be seen that in the training data, the highest accuracy is obtained from the combination of using the POS Tag and Cosine similarity features, which is 88%. However, in testing data, the accuracy decreased to 77%. The same condition was when using the POS Tag feature only where the accuracy of the training data was obtained by 86%, but decreased to 80% in the test data. In using the cosine similarity feature, better results are obtained, the accuracy of the training data is obtained by 81% and increases to 85% in the test data. As for the indoBERT model, an accuracy of 54% was obtained on the training data, and 95% on the test data.
Based on the results of this performance it was concluded that on the training data, the use of a combination of POS Tag and cosine similarity features improves the performance of the algorithm. However, in the testing data, the highest accuracy is obtained when only using the cosine similarity feature. In other words, the model is overfitting to the dataset. Based on our evaluations and observations, this condition occurs because the data used is less varied when compared to the complexity of the model. In this case, it can be seen from the words used in the Indonesian questions at the primary and secondary school levels that are still less varied.

CONCLUSION
A model has been built to identify the similarity of questions at primary and secondary school levels using the SVM and indoBERT algorithms. In the implementation, we extract the lexical and syntactic features of each question. The experiment results show that the best model performance is obtained from the use of the cosine similarity feature in the SVM algorithm. Meanwhile, the use of the POS Tag feature or the combination of POS Tag and cosine similarity causes the model to become overfitting to the dataset and the model accuracy decreases.
To improve the performance of the model in future studies, we propose the use of other feature extraction, such as TF-IDF, a count vectorizer that focuses on the frequency of occurrence of words in a document, and also evaluate the semantic features with various feature extraction approaches. In addition, improvements can also be made to the dataset by increasing the number and variety of words used.