PERFORMANCE COMPARISON OF SUPERVISED LEARNING USING NON-NEURAL NETWORK AND NEURAL NETWORK

Currently, the development of mobile phones and mobile applications based on the Android operating system is growing rapidly. Many start-ups and startups are digitally transforming by using mobile apps to provide disruptive digital services to replace existing obsolete services. This transformation prompts attackers to create malicious software (malware) using sophisticated methods to target victims of Android phone users. Research in the field of security by analyzing Malware statically, has been very saturated and the accuracy results have reached 98% and many have even reached 99% accuracy. As a new challenger, the researcher wants to increase the accuracy of more than 99% by using the static method. The purpose of this study is to identify Android APK files by classifying them using Artificial Neural Network (ANN) and Non-Neural Network (NNN). ANN is a Multi-Layer Perceptron Classifier (MLPC), while NNN is a method of KNN, SVM, Decision Tree. This study aims to make a comparison between the performance of Non-Neural Networks and Artificial Neural Networks. The problem that occurs when classifying using the Non-Neural Network algorithm has a problem with decreasing performance, where performance often decreases if it is done with a larger dataset. Answering the problem of decreasing model performance, a solution with the Artificial Neural Network algorithm is used. The artificial neural network algorithm chosen is the Multi_layer Perceptron Classifier (MLPC). Using the Non-Neural Network algorithm, K-Nearest Neighbor conducts training with the 600 APK dataset achieving 91.2% accuracy and training using the 14170 APK dataset reduces its accuracy to 88%. The use of the Support Vector Machine algorithm with the 600 APK dataset has an accuracy of 99.1% and the 14170 APK dataset has decreased accuracy to 90.5%. The use of the Decision Tree algorithm to conduct training with the 600 APK dataset has an accuracy of 99.2% and training with the 14170 APK dataset has decreased accuracy to 90.8%. Experiments using the Multi-Layer Perceptron Classifier have improved accuracy performance with the 600 APK dataset reaching 99% accuracy and training using the 14170 APK dataset increasing the accuracy by reaching 100%.


INTRODUCTION
Currently, the development of malware APKs is increasing, along with the number of Package Kit Applications (APKs) which are applications that run on the Android operating system. So many Android APKs, causing more and more certain parties to attack for purposes that benefit malware authors. Therefore, it is very detrimental for Android phones that have been infected with malware. From year to year the development of malware is increasing, for this reason this research uses the topic of Android malware.
Intents are interfaces that connect interactions between Activities in an Android APK. Additionally, Intents send data to other Activities, such as sending data to other applications (Gmail, Google Maps, etc.). In essence, Intents are mechanisms for performing actions and communication between application components.
Originality: Most journals in the literature review focus on feature clearance, rarely exploring feature intent. Android APKs to enable actions or activities that call components, send data, require feature intents. Without feature intents, Android cannot perform action functions. Therefore, this research focuses on feature permissions and feature intent.
Malware classification has been carried out by applying machine learning, such as the use of the K-Nearest Neighbor algorithm, Support Vector Machine and Decision Tree. The average classification performance accuracy is good, however using large datasets the classification performance accuracy decreases. Then an experiment was carried out by applying a deep learning algorithm, namely Multi-Layer Perceptron (MLPC). Some experimental results continue to increase in accuracy as the number of data sets increases.

LITERATURE REVIEW
In this study, we compare with previous research that discusses the Android malware APK. The attackers created malware using a new method of targeting victims of Android mobile phones. Several studies have used effective tools to carry out the malware detection process as accurately as possible.  Table 1 shows a lot of research using extract on feature permissions, system calls, API Calls, Net Info, but still very rarely uses feature intent. This feature intent is an addition to the research, in addition to using feature permissions. This research uses feature permission and feature intent.
The reason for the research, the use of Non-Neural Networks such as KNN, Support Vector Machines and Decision Trees are already good in accuracy performance, however producing high accuracy performance can be improved with better algorithms. To overcome the research gap, this study uses the Neural Network algorithm in an experiment to produce better accuracy performance.

THE STATE OF THE ART
The state of the art research is a training dataset with Permission and Intent features using an Artificial Neural Network. Where the static analysis intent feature is rarely done by researchers.
This study aims to compare the performance accuracy of Non Neural Network and Artificial Neural Network on Android APK file identification by classifying Android APK files using Multi-Layer Perceptron Classifier. The main contribution of this paper is to improve the accuracy of the classification performance of Non Neural Network by applying Artificial Neural Network algorithm using Multi-Layer Perceptron Classifier (MLPC).

RESEARCH QUESTIONS
Based on the description, there are several research questions in this paper. RQ 1, How to extract malware dataset using permission feature and intent feature? RQ 2, What is the percentage of application of the K-NN algorithm, Support Vector Machine and Decision Tree? RQ 3, What is the percent increase in accuracy with the implementation of the Multi-Layer Perceptron algorithm? RQ 4, Is it effective to perform malware analysis using static methods? This article contains articles that contain: Section 1 Introduction. Section 2 Research methods presents a literature review of several articles related to the classification of Android malware. Section 3 presents the results of the experiments that have been carried out. Section 4 includes a summary of the paper.

RESEARCH METHODS
The methodology proposed for this research is as follows. This stage is to create a dataset from Android APK files that are indicated as malware or Benign. The malware APK files are downloaded from the University of New Brunswick. The file has been labeled for types of malwares.
The downloaded file is accommodated to local storage, then the classification process is carried out and stored in a similar folder.
Next, the Android APK file extraction feature is carried out using reverse engineering. Many reverse engineering tools are commonly used. In this research, reverse engineering uses the JADX module. The result of the reverse engineering process is some folders and files AndroidManifest.xml. Files and folders other than AndroidManifest.xml are deleted, while AndroidManifest.xml is then parsed to read the permissions and intent features. The results of the feature extraction [12] process produce a malware dataset. The next process is classification using machine learning or deep learning algorithms [13].

Pipeline 2: Prepare Training Dataset malware.
Before training the malware dataset, the prepare stage is very necessary. To generate a model from a machine learning or deep learning training process must use a clean dataset, a good dataset (no null, incorrect data in features). The dataset must ensure that the contents of the malware Dataset should not be mixed with the Benign data. If there is a mixture of malware and Benign, the resulting model will experience errors and affect the performance of the model.
In addition to the data cleaning process, there are also engineering features, namely feature analysis and the most influential features. This process must be carried out because this process is also very influential on the resulting model.
The next process is to create a uniform dataset, in the sense that if there are five groups of datasets, then the dataset must be an unmixed dataset. For example, the malware APK dataset is of the Ransomware type, then the Ransomware dataset should not be mixed with the Riskware APK dataset.
The division of the number of datasets for machine learning is to divide the 70% training dataset and 30% testing data. But there is no requirement to do so. There are also those who share it, 60% training data and 40% testing data. Sharing datasets for deep learning, training data, validation data and testing data. Example (Data Training + Data Validation) = 70%, while testing data is 30%.
Cross validation of datasets or swapping training positions with testing is also carried out to get the performance model that will be generated by machine learning or deep learning.
Some of the reasons for this data preparation is done:  The data owned is not ideal, there is data that is missing value. Missing data in the dataset will result in a declining model for its performance. Filling must be done so that the dataset becomes intact and good. It is not permissible to fill in the dataset arbitrarily and an analysis of the features or dimensions of the appropriate dataset must be carried out.  There are different data formats. To avoid differences in formats in the feature dataset, it is necessary to check, validate the dataset and analyze the features of the dataset.  Small datasets or datasets that are not balanced from the ideal in terms of quantity. Small data sets are not ideal for machine learning or deep learning processes to be generated as models. This invalidates the model. The Synthetic Minority Oversampling Technique (SMOTE) is a way to balance datasets, if machine learning is done, to produce good models. This study did not use the SMOTE method, because the datasets in each class were balanced. The SMOTE annotation is only used on unbalanced malware dataset classes.  The dependent variable and the independent variable are not clear or have no label.

Pipeline 3: Training and Testing Process.
This stage is conducting training on the malware dataset. Training using the KNN Algorithm, Support Vector Machine and Decision Tree. The distribution of the dataset is carried out, the training dataset is 70% and the testing dataset is 30%. The Multi-Layer Perceptron Classifier algorithm [14], [15] is also used for this stage. The training process is also carried out using changes in the position of the training dataset and testing dataset, which is better known as cross validation. In this study using 5fold cross validation, to get better model accuracy. where data is separated into two subsets, namely learning process data and evaluation data. The model or algorithm is trained by the learning subset and validated by the validation subset. Furthermore, the selection of the type of CV can be based on the size of the dataset. CV K-fold is used because it can reduce computation time while maintaining the accuracy of the estimate. 5-fold CV is one of the K-fold CVs used for selecting the best model because it tends to provide less biased accuracy estimates. In 5-fold CV, the dataset is divided into 5 folds of approximately equal size, thus having 5 subsets of data to evaluate model performance. For each of these 5 subsets of data, CV will use 4 folds for training data and 1fold for testing.

Pipeline 4: Prepare New APK data to be tested
At this stage the aim is to add new datasets. If in performing the classification and new variants of malware are found, before being entered into the dataset, the data must be feature extraction. Then retraining is carried out. The more datasets, the better the classification model in identifying malware APK.

Pipeline 5: Decision Classification Output Label.
The last stage aims to produce a classification model and the model is ready for deployment. Testing the model before the model is ready for use, aims to anticipate model errors in identifying Android APK files.
In this section, the researcher discusses malware analysis and classification [

Static Analysis
Static analysis [23] is a malware analysis method by analyzing source code. Reverse engineering is used to obtain the source code file, which converts the executable file into a source code file. To analyze the malware APK file, for example, the APK file must be reverse engineered. Analyzing static malware does not need to run the application.
Using the JADX module from APKTOOL, to do reverse engineering. The source code to be analyzed is the AndroidManifest.xml file. This file is then read or parse android-permission and android-intent. Some purposes for reverse engineering:  To know the protocol of a program. For example: want to create a command line Instagram client.


To find out the API used by a program. For example, you want to know how to turn on the camera flash as a flashlight.  To find security bugs for a program.  To find out if a program violates copyright. For example, we suspect that a program uses a commercial library that we created, without paying for a license.
For forensic purposes. For example, we want to know the data format used by a program.

Dynamic Analysis
Malware is a threat to Android, various methods are used to analyze malware, one of which is using dynamic analysis. Analyzing Android malware with dynamic methods aims to understand its behavior and improve the ability to detect it. Dynamic analysis also takes an analytical approach to analyze Android malware behavior. How to perform analysis by running malware code in a virtual environment to understand the actual behavior of malware.
The dynamic analysis method, does not examine the source code, but runs the malware files in a controlled environment, which is called a sandbox. This way the behavior of the malware can be analyzed in a controlled environment, this is very useful where the malware does not spread to other systems. After observing the behavior of malware, a log of malware activity is obtained. This log will be analyzed.

Hybrid Analysis
Dynamic malware analysis is a combination of static analysis and dynamic analysis, where the analysis runs the malware in a controlled environment after that it also analyzes the source code. Hybrid model analysis is a perfect and complete analysis for analyzing a malware. Euclidean distance [28] is a formula for finding the distance between two points in twodimensional space. Hamming distance [29] is a way to find the distance between two points which is calculated by the length of the binary vector formed by the two points in the binary code block. Manhattan Distance [30] is a formula to find the distance d between 2 vectors in n dimensional space. Minkowski distance is a formula for measuring between two points in a normal vector space which is a hybridization that generalizes the Euclidean distance and Manhattan distance.

K-Nearest Neighbor
The K-Nearest Neighbor (KNN) [31], [32] algorithm is a classification of objects based on the learning data that is closest to the object. Then the determination of the K value is carried out. It is determined that the K value is odd, after that a vote is carried out on the closest distance. The advantage of SVM is that Supervised is able to control the accuracy of classification and Kernel trick is able to classify with nonlinear data. Disadvantages of SVM, not good for large amounts of data and Kernel trick is not easy to implement.

Decision Tree
The  [45] which is a decision tree. Decision trees can convert data into decision trees and decision rules. The benefits of DT are its ability to break down complex decision-making processes into simpler ones, so that decision-makers better interpret problem solutions. the Android APK file is Banking malware, if true then the Android APK file is a type of Banking malware. And keep asking the type of malware Ransomware, Riskware, SMS malware to APK Benign.
Making a Decision Tree model [46], [47], [48] is like drawing an inverted tree where the Root Node is in the top position. Internal Node that has 1 input and at least 2 outputs. Leaf Node is the final Node, has 1 input and has no output.

Multi-Layer Perceptron Classifier
Multi-Layer Perceptron [49], [50] is a classification algorithm that works by using a deep neural network. This algorithm is very different from machine learning algorithms based on statistical science.
By using the deep neural network method, it is expected that the performance of the model is more accurate, when compared to machine learning. Here's the architecture of the Multi-Layer Perceptron Classifier:

RESULTS AND DISCUSSION
In conducting the experiment, using the MacBook Air 2020 hardware with specifications of 8 GB RAM, 256 GB storage. Using the Python programming language in the Jupiter Notebook package, the reverse engineer JADX module made by APKTOOL. In this section, answer research questions and report experimental results. RQ 1, How to extract malware dataset using permission feature and intent feature? This is a much-needed step, where this step generates a malware dataset. APK files are downloaded and extracted, reverse engineered and parsed to read feature permissions and feature intents. The final result of feature extraction is a malware dataset. Following are the feature-feature permissions of the malware dataset: Accuracy is the ratio of correct predictions (positive and negative) to the entire dataset. Accuracy and answer the question "What percentage of Android APK files correctly predicted Malware and Benign from the entire dataset of Android APK files". Accuracy = (TP + TN) / (TP + FP + FN + TN) [51]. Accuracy can be seen in table 4. F1 Score is a weighted comparison of the average precision and recall F1 Score = 2 * (Recall * Precision) / (Recall + Precision) [51]. F1-Score can be seen in table 6. Recall is the ratio of true positive predictions compared to the total number of true positive data. Recall answers the question "What percentage of Android APK files are predicted to be malware compared to all students who are actually malware". Recall = (TP) / (TP + FN) [51]. Recall can be seen in table 7. There is a decrease in performance for the model generated from the K-Nearest Neighbor algorithm, Support Vector Machine and Decision Tree. RQ 3, What is the percent increase in accuracy with the implementation of the Multi-Layer Perceptron algorithm? The performance of the Multi Layer Perceptron Classifier (MLPC), the trained dataset is 600 APKs, 7000 APKs and 14170 APKs. The training was carried out with these three datasets, so the graphical display can be seen in Figure 6.  RQ 4, Is it effective to perform malware analysis using static methods? Using this static method does not require running the malware into an isolated or controlled environment. The malware APK file is only extracted, then stored into the malware dataset. The dataset is classified using the classification method and then the model is tested with the extracted malware dataset. The results are effective for detecting the Android APK file is infected with malware or normal. The static method is actually simple and works effectively in malware detection.

CONCLUSION
Based on the results of experiments conducted in this study, it can be concluded that classification using machine learning produces good accuracy in the K-Nearest Neighbor algorithm, Support Vector Machine, and Decision Tree. However, the use of larger data sets leads to a decrease in inaccuracy. Application of Artificial Neural Network with Multi-Layer Perceptron Classifier is the answer to the problem of using Non-Neural Network. Training using Non Neural Network on large datasets still produces high accuracy performance. Accuracy performance of Non Neural Network (NNN) such as K-Nearest Neighbor algorithm on average = 88%, if using APK 14170 dataset. Average accuracy of Support Vector Machine = 90.5%, when using APK 14170 dataset. Decision Tree accuracy on average = 90.8%, when using the APK 14170 data set. Accuracy performance on Artificial Neural Network (ANN) with Multi-Layer Perceptron results in 100% accuracy, using the APK 14170 dataset. High accuracy performance, resulting in good models in identification Android Malware APK file, because it is very appropriate if applied in an identification application.

SUGGESTION
Research can be continued by using dynamic methods and hybrid methods. Malware can be researched better if using dynamic methods, because dynamic methods will determine the behavior of malware. The results of dynamic research can make malware groups, carry out attacks with specific targets. Hybrid method is a method that combines static methods and dynamic methods, which makes the perfection of malware analysis as a whole. A. Strzelecka and D. Zawadzka, "Application of classification and regression tree (CRT) analysis to identify the agricultural households at risk of financial exclusion," Procedia Comput.