Content area
Ransomware poses a significant threat to Android devices, presenting a pressing concern in the realm of malware. While there has been extensive research on malware detection, distinguishing between various malware categories remains a challenge. Notably, ransomware often disguises its behavior to resemble less harmful forms of malware like adware, evading conventional security measures. Therefore, there is a critical need for advanced malware category detection techniques to elucidate specific behaviors unique to each malware type and bolster detection efficacy. This paper aims to enhance Android ransomware detection by investigating the optimal combination of static features (such as permissions, intents, and API calls) and dynamic features (captured from network traffic flow) that contribute to minimize false negatives when applying supervised machine learning classification. Our research also aims to discern the pivotal features essential for accurate ransomware detection. To this end, we propose a model integrating feature selection techniques and employing various machine learning classifiers, including decision trees, k-nearest neighbors, random forest, gradient boosting, and bagging. The model was implemented in Python, and its evaluation was conducted with and without k-fold validation to offer a broader range of explored behaviours. Our findings highlight the efficacy of combining network-Permission and network-API features, exhibiting superior ransomware detection rates compared to other feature combinations. Moreover, our model achieved recall scores of 99.2 and 97% before and after employing cross-validation, respectively. We also identified 6 API features, 27 network features, and 18 permission features as the most useful ones for ransomware detection in Android.
INTRODUCTION
The Android operating system has become a dominant force in the mobile industry due to its open-source nature, comprehensive documentation, and robust (both official and unofficial) support community [1, 2]. Nevertheless, this same popularity has also made it a prime target for malicious actors, given its vast user base, extensive reach, as well as the presence of repositories with varying levels of security (e.g., Appchina.com), which serve as fertile grounds for hackers to deploy malware [2, 3]. Among the plethora of malware threats targeting Android devices, ransomware stands out as particularly insidious, employing sophisticated packaging and encryption techniques to conceal its presence. Ransomware can perform a broad range of malicious activities such as device locking, file encryption, system configuration alterations, unauthorized access to sensitive areas, and breaches of user privacy. This is illustrated by the Kaspersky’s 2022 annual report, which reported 3821 instances of successfully installed mobile ransomware [4]. Other similar reports emphasize the severity of cyber-attacks due to malware in industry in general [5, 6]. This type of reports demonstrates the need for more robust security measures and proactive defense strategies to safeguard Android users against evolving malware threats.
There are various methods for ransomware detection, including static, dynamic, and hybrid analysis [7, 8]. Static analysis involves scrutinizing the code, while dynamic analysis observes the behavior of malware during execution, such as monitoring network traffic. However, static analysis may prove ineffective due to sophisticated packaging and obfuscation techniques employed by ransomware authors. Similarly, dynamic analysis may lack comprehensiveness due to the challenges associated with extracting dynamic features accurately [9, 10]. Given these limitations, many researchers have turned to hybrid techniques which combine both static and dynamic detection methods to compensate for their individual weaknesses. This research work follows a similar strategy and aims to address the following research question: Which combination of static and dynamic features can yield an optimal ransomware category detection, distinguishing ransomware from other malware types?
This research work utilizes the publicly available dataset CIC-AndMal2017 [9] which has been developed by the Canadian Institute for Cybersecurity. This dataset was chosen because it is composed of data collected from real-world Android applications installed on authentic devices in genuine environments, from which dynamic features were extracted. In terms of evaluation metrics, the recall score was considered as the main one due to the serious consequences that false negatives might have in the ransomware detection field. That is, ransomware incorrectly identified as a benign application, or misclassified as a less harmful form of malware (e.g., adware), poses a grave risk to users. For instance, a single undetected instance of ransomware could potentially lock an entire device, or compromise the whole infrastructure of an organization.
The main contribution of this paper is the design and experimental validation of a model that accepts various combinations of static or dynamic features. More specifically, this work presents a model that leverage a combination of a set of 5 well-established classification machine learning algorithms [11] (i.e., decision trees (DT), k-nearest neighbors (KNN), random forest (RF), gradient boosting, and bagging). In terms of results, the best recall score was 99.2% and was obtained by the gradient boosting classifier and the combination of network permission and network-API, which proved to be the best feature combination. Additionally, our analysis identified a set of 51 relevant features (6 for API calls, 27 for network flow, and 18 for permissions) among all the extracted features from the dataset used (where the universe of unique attributes, i.e., features were 281, 114, 528, and 77 for permission, API, intent, and network, respectively).
The rest of this paper is as follows: Section 2 presents the related work. Section 3 presents the research methodology used. Section 4 describes the developed model and prototype. Section 5 discusses the evaluation performed. Finally, Section 6 concludes this work as well as presents some pointers of future work.
RELATED WORK
Numerous research efforts have pursued to improve the ransomware detection process. Next, we discussed the most relevant state-of-the-art literature related to our work.
The authors of [9] proposed a systematic approach to generate a comprehensive dataset for Android malware detection called CIC-AndMal2017. There, the authors introduce a dataset of about 5000 benign and 426 samples in 5 different categories (i.e., adware, ransomware, SMS malware, and scareware). To created it, the authors installed all the sample applications on real Android smartphones, in a real environment, and captured 6 different set of features during three stages: (1) during installation, (2) 15 min before restarting the device, and (3) 15 min after restarting the devices. The captured features are related to network traffic, memory usage, system logs, permission, API calls, and phone statistics. Furthermore, authors only used the network traffic features for malware category detection. They processed the captured traffics with the CICFlowMeter (a publicity network-flow feature extraction tool). Using this tool, they were able to extract 80 different network traffic features. In that same work, the authors used the CFs Subset Eval and Infogain feature selection methods in Weka (a Java-based machine learning framework [12]). By relying exclusively on network traffic features, authors could detect 9 important features for malware detection and, by using the DT, KNN, and RF classifiers, they achieved a recall score of 85% for malware detection and 49.5% for malware category detection. The biggest strength of this study is the usage of real data. On the contrary, using only network features without considering static features is its most noticeable area of opportunity. Also, a recall score of 49.5% is relatively low for ransomware detection (as it is lower than random guessing).
Meanwhile, the work on [10] conducted a two-layer analysis (i.e., static and dynamic sequentially) on Android malware detection. There, the authors used CIC-AndMal2017 [9] as the main dataset. The static analysis performed consists of permission and intent features extracted from the AndroidManifest.XML (composed by 8115 different features). The second layer of analysis conducted utilizes dynamic features (i.e., the network traffic features derived in [13], as well as dynamic API calls). For the dynamic API calls, the authors used 2 g sequences and captured API calls before and after a restart and, by using Python natural language processing tools (i.e., NLTK [14]), the authors created a dataset of API calls for each sample. Then an aggregation of network flow features (i.e., the average score of each column of each sample) was appended to the API call dataset. The framework proposed in this work could reach a recall score of 95.3% for malware detection, 83.3% for category detection using the Random Forest classifier. The main strength of this work is the usage of more features (in comparison to [9]). More specifically, they used two types of static features (i.e., permission, and intent) and two dynamic ones (i.e., network traffic and API call). Similarly, the overall recall score significantly improved. However, the authors only used the Action section of intent (while in the Manifest file intent includes Action, Receive, and Service). Additionally, they did not feed the classifier with all the features simultaneously. Finally, in this research, the authors did not suggest any set of features important for malware/ransomware detection.
Moreover, the authors of [15] introduced a novel hybrid approach for malware detection. Although their work did not focus on category detection, their approach inspired some ideas used in our work. In that work, the static features (i.e., permission, API-call) and a dynamic feature (i.e., system call) are used, and their results reported that there is a high interdependency between permissions, API calls, and systems calls, which might cause multicollinearity problems. This means that features might not be distinguishable when they are intensely related to each other. Their approach includes two phases: The first one uses 3 ridge-regularized LR classifiers for each feature, then the second one uses a 3 augmented naive (TAN) algorithm to correlate all outputs for malware detection. The main strength of this work is considering both static and dynamic features, achieving a precision rate of 97%, as well as giving a hint of multicollinearity on machine learning algorithms. Also, this work presented a list of permission, API call, and system call features relevant for malware detections. However, they used a relatively old dataset and their extracted features did not come from real machines. In this work, because the chosen algorithms Random Forest and Gradient Boosting are fundamentally based on Decision Trees, they do not suffer multicollinearity. Also, inspired by this work, we have fed the classifiers with all features simultaneously.
The work presented on [16] introduces an approach of detecting ransomware based on a deep learning model of long short-term memory (LSTM [17]) which uses the CIC-AndMal2017 dataset. In this work, the authors applied 8 different feature selection methods (e.g., ChiSquare, GainRAtio attribute evaluation, and CVAttribute evaluation) in order to reduce the time of processing. Additionally, this work only used the network traffic features of CIC-AndMal2017 and, after applying the feature selection, they choose 20 network traffic features to feed the LSTM model. The best recall and accuracy scores achieved by this study were 97%. A particular strength of this work is using different feature selection methods to ensure the highest voting features have participated in the analysis process. However, their results only rely on the network features and did not consider other static or dynamic ones.
Meanwhile, the paper [18] introduces a ransomware detection approach that relies exclusively on permissions. The approach starts by extracting all permissions from benign and ransomware APKs, then it applies a series of frequency analyses on the permission requests. In terms of features, this work introduces 23 permissions that have never been requested by ransomware (e.g., Access Modification policy or Bine Authofil Service). Also, the authors list the top 10 ransomware permissions typically requested (e.g. Read Phone State or Write Externa Storage). Finally, the authors used RF, DT, SMO, and naive Bayes algorithms, and the highest accuracy score reported in their work was 96.9% (obtained by RF). The main strength of this work is presenting the list of permissions most commonly requested by ransomware, as well as the list of permissions which have never been requested.
The authors of [19] present a ransomware category detection approach that fully relies on API information (as ransomware highly resorts to API calls in order to execute their malicious activities). The model proposed includes preprocessing (i.e., extracting DexCode), feature extraction (i.e., extracting packages, classes, and methods), and classifying with Random Forest. The authors also prepared their dataset by using virus-total and the accuracy score of 99% on package level and 100% on method level. The obtained results show that, the more fine-grained data is extracted, the higher the accuracy rate which can be achieved. The main strength of this work is listing the most important API call for ransomware in three levels of packages, classes, and methods; and how they could achieve an accuracy of 100% in certain cases. However, still in many obfuscations and packaging techniques, their model is significantly less accurate (e.g., the authors reported 90% accuracy in those scenarios).
Moreover, the authors of [20] presented a model of Android ransomware detection based on system calls and using the RF, J48, and naive Bayes classifiers. This work used its own created dataset by leveraging the Virus-total dataset for malware samples and Google Play for benign ones (400 for each type), as well as the Genymotion Android emulator. The results of this work reported the best Accuracy score of 98.31% for Random Forest, and the authors also released a list of the system calls mostly used by ransomware. The main strength of this research work is the list of most used system calls by Ransomware; however, they did not present enough information about their dataset, as well as how the emulator was used in their work.
Finally, the work discussed in [21] presents a model of Android malware detection exclusively based on permissions. There, their authors use a dataset from the University of METO (in Turkey) which includes about 2300 benign and 1400 malware samples to compare two feature selection categories (i.e., attributebased and subset-based). This is because, while the former category ignores interdependencies and analyzes every feature individually (like Gain Ratio, Relief F), the latter one creates random subsets and chooses the best subset that represents the entire dataset (like CFS, consistency subset evaluator). Furthermore, this work aims to integrate feature selections with different classification algorithms like RF, J48, SMO, CART, Bayesian and apply them to the dataset of permissions. Their results show that CFS and random forest integration achieved the best accuracy rate. The main strengths of this work are the analysis of different feature selection methods, and the list of important permission-related features.
Summary: The authors of [9] developed CIC-AndMal2017, a comprehensive dataset for Android malware detection featuring 5000 benign and 426 malware samples across five categories, collected by capturing six sets of features during specific stages on real Android devices. They focused on network traffic features for malware category detection, achieving a recall score of 85% for malware detection and 49.5% for category detection. Subsequent studies expanded upon this dataset and methodology, incorporating additional static and dynamic features like permissions, API calls, and system calls, leading to improved detection accuracy. Various research works have also explored different feature selection methods and machine learning classifiers, with results highlighting the importance of both static and dynamic features, though limitations (e.g., reliance on network features alone and multicollinearity) have also been noted. Furthermore, each study has emphasized different strengths (e.g., real data usage, feature diversity, high accuracy), while also identifying areas for further improvement in the field.
RESEARCH METHODOLOGY
In our work, we tailored the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology [22] to guide our research process from data collection to model development. Initially, we began by identifying a proper dataset, then we selected a set of effective classifiers and feature selection techniques, as well as defined the required data preprocessing approach and evaluation metrics. Finally, we designed our model and evaluated its performance. The following sections describe the performed CRISP-DM tasks in more detail.
Dataset Selection
Previous research works have collected a comprehensive list of desirable characteristics for datasets to be suitable for malware detection [9, 10]. Based on it, we used the following features to identify the appropriate dataset for our work. A dataset should: (1) have a consistency in labeling, (2) offer both static and dynamic features, (3) have variety in terms of samples and categories, (4) include up-to-date samples gathered from the real world, and (5) have proper documentation and support.
Based on these criteria, we analyzed the following 6 datasets (selected from the literature, where they have been previously used for similar use cases): Drebin Android Malware [23], Androzoo [24], Android PRAGuard [25], Virusshare [26], CIC-InvesAndMal2019 [10], and CIC-AndMAl2017 [9]. All these datasets are well-known, have supporting documents, API for download APKs, as well as are varied and large enough. However, Drebin is relatively old, and it does not offer dynamic features. AndroZoo developers only includes APKs and requires a simulator (or an emulator) to extract dynamic features. Therefore, we discarded them. Like Drebin, the Android PRAGuard dataset is mainly suitable for analyzing different obfuscating techniques and code-based analysis. Thus, we did not consider it suitable for the goal of this project. Meanwhile, the Virusshare dataset lacks dynamic features, as well as a solid documentation. Finally, we selected the CIC-AndMAl2017 [9] because it was the best ranked dataset as per the above criteria. This dataset has been developed by the Canadian Institute of Cybersecurity, and includes 426 malware APK files and their corresponding network traffic CSV files.
Classifiers Selection
Choosing an appropriate set of classification algorithms was a key aspect of this research work. This is because having variety in classifiers can yield a better understanding as well as a more solid evaluation process. To this end, we chose a set of well-established classifiers which have been used previously in the literature for similar research problems (namely, decision trees, k-nearest neighbors, and random forest) [9, 13]. Additionally, we incorporated Gradient Boosting and Bagging techniques, in order to have more diversity in the algorithms evaluated. This design decision (i.e., the inclusion of a diverse set of ML algorithms) was taken with the aim of strengthening our evaluation and reduce the likelihood of experiencing common problems such as data noise, or having missing data, which can significantly impact the performance of ensemble ML algorithms [27, 28].
Data Preprocessing
As conducting the appropriate data preprocessing tasks can have a significant impact on the training, we decided to test different preprocessing methods before feeding classifiers. In this research, the sklearn.preprocessing package [28] is used. More specifically, we will use the scaling methods Scale (function to standardize a dataset based on the mean and standard variance values), and MinMaxScaler (function to transform the values to a specific range). We will also use the PCA (Principal component analysis) [29] to reduce the number of features by focusing on the principal ones.
Evaluation Setup
Regarding evaluation metrics, we have selected the following standard ML metrics [30]: (1) Recall score: This metric measures the number of false negatives (i.e., ransomware wrongly labeled as not ransomware). It is our main metric due to its relevance to our domain of study (i.e., ransomware), as it is critically important, since any undetected ransomware could have major impacts on the affected IT infrastructure, such as locking the entire device, or encrypting all the user data. (2) Accuracy score: It is considered as a complementary, yet relevant, metric in our evaluation and analyses. This metric shows the ratio of the number of correct predictions with respect to the total number of predictions. Additionally, our evaluation will use the k-fold cross-validation [31]. This is because k-fold helps to address overfitting problems and, in general, helps to better assess the generalization of a model. The former means how powerful and stable a model works when it is faced with new data. The latter one means how a model performs for any new set of datasets.
PROPOSED MODEL
The central concept behind our model revolves around leveraging a combination of static and dynamic features to be able to significantly diversify the possible combinations of features to train the different classifiers, then evaluate their performance to detect ransomware in Android in order to determine the best performer(s) as well as the most significant features for this research problem. The development of our model entails two key phases: (1) Pre-Processing: This initial phase involves a series of sequential Python scripts to facilitate static feature extraction from APKs, segregation of features, creation of vector matrices, calculation of network feature means, as well as some formatting (as CSV files) in preparation for the usage of this information in the second phase. (2) Main Analysis: This phases involves the usage of three Python scripts to configure the model settings, conduct the corresponding feature selection, and provide the input data used to train and test the different classifiers. These scripts also orchestrate the evaluation of the model performance. The following paragraphs described these phases in more detail.
Preprocessing Phase
Conducting preprocessing is essential before training a machine learning classifier. This is because appropriate preprocessing ensures that the data used for training is well-prepared, facilitating the development of more reliable machine learning models. In our work, the preprocessing phase involves the following 4 steps (depicted in Fig. 1):
Fig. 1. [Images not available. See PDF.]
Preprocessing phase: steps involved.
Step 1: This step involves extracting static features from APKs through the Androguard library. The name of the folders is used as the class name and, if any feature is null, the entire sample is removed. Finally, a CSV file with three columns will be generated.
Step 2: This step (which uses the CSV file generated in step 1 as main input), separates each feature permission, intent, and API into three datasets. The number of columns in these vector matrices is equal to the number of all unique extracted features, and if a sample has a specific feature then the corresponding column will be one (otherwise, it will be zero). Finally, header files are created after feature selection, as this information will help distinguish more easily the important features.
Step 3: This step involves calculating the mean value of the features (per input file), and each CSV file will become one entry in a final (i.e., consolidated) CSV file. The number of CSV files for non-ransomware and ransomware must be equivalent to the records existing in the Pure_Extracted_features (generated in Step 1).
Step 4: Finally, this step includes removing the indexes and any additional (unnecessary) data from the CSV files in order to make them more suitable for feeding the classification algorithms.
Main Analysis Phase
This phase is composed of three steps (depicted in Fig. 2, each one encapsulated in a different Python script). Step 1 (i.e., Main Model.py) accepts the configuration for scaling and PCA and, more importantly, it indicates which input files should be sent to classifiers. By using sklearn the dataset is split into training (80%) and testing (20%) following the Pareto Principle. Step 2 (i.e., Feature Selection.py) uses the sklearn feature selection method and creates a list of the selected features based on the classifier fit model. Finally, step 3 (i.e., Classification Algorithms.py) involves the creation, training, and evaluation of the classifiers, including saving their outputs. Overall, this model utilizes 7 different concatenations of static and dynamic features, 2 scaling methods, and 2 feature selection methods. Thus, a total of 28 different experiments were required to cover all experimental configurations (i.e., 7*2*2).
Fig. 2. [Images not available. See PDF.]
Main analysis phase: steps involved.
Prototype
The model was implemented in Python (version 3.8.3) using the Anaconda distribution (version 1.10.0), the sklearn machine learning library (version 1.3), and the Androguard package (version 3.4.1a). Fig. 3 shows a summary of the main configurations used by the prototype. The following paragraphs describe the most relevant technical aspects related to the two model phases (i.e., preprocessing and analysis):
Fig. 3. [Images not available. See PDF.]
Summary of prototype settings.
– Preprocessing – Feature Selection: This work uses the SelectFromModel module from sklearn. It is worth noticing that this function only works for algorithms whose estimators have implemented either the coef_ or feature importances_ attribute. Thus, as the bagging classifier lacks these attributes, the entire dataset is used in this scenario. Furthermore, any features whose values fall below the threshold for each of the above function, were removed from the dataset.
– Preprocessing – Static Feature Extraction: Two folders are created (i.e., nonransomware and ransomware) to organize the samples and their features. The nonransomware folder contains 325 APKs of different families of Adware, SMS malware, and Scareware. The ransomware folder contains 101 samples of different families. A Python script (i.e., Static FeatureExtraction.py) is used, which performs two pre-processing loops. The first one executes based on the number of target folders and the second one executes based on the number of APKs in those folders. This script creates a data frame within which each row contains the sample’s name, and each static feature (i.e., permission, intent, and API call) is saved in a new column. Then the folder’s name is used as the label for the class column. Next, the corresponding method from the Analyze package extracts the intents and permissions from the AndroidManifest.XML file. Three nested loops extract the information related to the intents in the activity, receiver, and service and save them in a dictionary. Moreover, for the same sample, the permissions listed (in AndroidManifest.XML are also extracted and saved in the same dictionary. Finally, for the API calls, the dx method (from the Analyze package) is used to create a map between the permission requests and the API calls.
– Preprocessing – Vector-Matrix Creation: To create a Vector-Matrix for permissions, intents, and API calls, we used another Python script which iteratively extracts all the unique permissions (identified by being commaseparated and enclosed by []). Within this loop, the comma ], and [ characters are also removed and the corresponding permissions’ name are added to a list to keep track of them. Then, in a nested loop, the script picks a permission attribute and checks the entire permission set. If there is a match anywhere, it puts 1 in the corresponding field (otherwise a value of 0). In the final output CSV file, a row represents an individual sample, and the columns show all the extracted permissions from the AndroidManifest.XML file. For every sample, a series of 1 and 0 shows which sample owns which attributes. At the end of this stage, 281 unique attributes for features, 114 unique attributes for API calls, and 528 unique features for intents were extracted in total from the CIC-AndMal2017 dataset. Also, the corresponding headers are extracted for feature selection (and manual analysis) purposes.
– Preprocessing – Network Mean Evaluation: This step utilized Excel files which are part of CIC-AndMal2017. The original dataset includes 101 CSV files for ransomware and 325 CSV files for nonransomware (for a total of 10-854 different samples). A Python script (i.e., network mean calculation.py) is executed to create a consolidated view of the samples. Such script reads every single CSV file, calculates the mean of every feature, and creates a single record for that sample (in a consolidated file). The output CSV file (from this task) uses the folder name as the class name (i.e., Nonransomware or Ransomware), and captures the sample name under the name column. For this work, we chose not to use a class-balancing technique in order to preserve the real-world distribution of this dataset (one of the reasons to select it in the first place).
– Main Analysis: At this stage, the dataset is ready for analysis, as all features required are available in 4 separate files: All data extracted previously is consolidated into a data folder, while the corresponding header files are placed into a homonym folder. All configurations previously discussed (e.g., preprocessing type, usage of PCA) are controlled via variables (as depicted in Fig. 4), so that all planned experimental configurations can be executed by simply changing the values of those variables. As previously mentioned, Gradient Boosting was a special case which required its own configuration variable: If the Feature_Selection Module for_GB is set to true, then Gradient Boosting will build its model based on the output of the feature selection module in sklearn. Otherwise, the entire dataset will be passed to it.
Fig. 4. [Images not available. See PDF.]
Main analysis settings.
Fig. 5. [Images not available. See PDF.]
Recall scores without cross-validation. (a): Average recall score per group. (b): Recall breakdown for group A. (c): Recall breakdown for group B.
EVALUATION
As previously mentioned, this research aims to find out which combination of dynamic (i.e., network traffic features) and static (i.e., permission, intents, and API calls) could have a better result to detect ransomware. Besides, this research tries to find a set of important features for ransomware detection. The following paragraphs discussed the results of the experiments conducted to evaluate the proposed model and the performance of the chosen classifiers across the different experimental configurations. The first part presents the results obtained without k-fold cross-validation, as these results can offer a performance baseline, while the second part of the section presents the results obtained after applying a 10 k-fold cross-validation. Then, the section concludes with a discussion of relevant features identified, as well as the main insights derived from our evaluation.
Evaluation without k-Fold Cross-Validation
As previously discussed in Section 4.2, our experiments involved 28 different experimental configurations. They are listed in Table 1 and have been categorized in groups as follows: Group A involves all experimental configurations where the scaling method is scale and the PCA flag set to false, while group B involves those configurations using the MinMax scaling method and the PCA flag set to false. Similarly, groups C and D involves the configurations which uses the PCA flag set to true and the scale and MinMax scaling methods, respectively.
Table 1. . Experimental configurations evaluated
Group | Exp. # | Concatenation type | Scaling method | PCA flag |
|---|---|---|---|---|
A | 1 | Network, Permission, Intent, API | Scale | False |
2 | Network, Permission, Intent | Scale | False | |
3 | Network, Permission, API | Scale | False | |
4 | Network, Intent, API | Scale | False | |
5 | Network, Permission | Scale | False | |
6 | Network, Intent | Scale | False | |
7 | Network, API | Scale | False | |
B | 8 | Network, Permission, Intent, API | MinMax | False |
9 | Network, Permission, Intent | MinMax | False | |
10 | Network, Permission, API | MinMax | False | |
11 | Network, Intent, API | MinMax | False | |
12 | Network, Permission | MinMax | False | |
13 | Network, Intent | MinMax | False | |
14 | Network, API | MinMax | False | |
C | 15 | Network, Permission, Intent, API | Scale | True |
16 | Network, Permission, Intent | Scale | True | |
17 | Network, Permission, API | Scale | True | |
18 | Network, Intent, API | Scale | True | |
19 | Network, Permission | Scale | True | |
20 | Network, Intent | Scale | True | |
21 | Network, API | Scale | True | |
D | 22 | Network, Permission, Intent, API | MinMax | True |
23 | Network, Permission, Intent | MinMax | True | |
24 | Network, Permission, API | MinMax | True | |
25 | Network, Intent, API | MinMax | True | |
26 | Network, Permission | MinMax | True | |
27 | Network, Intent | MinMax | True | |
28 | Network, API | MinMax | True |
To analyze all the different 28 outputs, we considered the average recall score for each group. In other words, this average score explains which experimental setup performs better than the others. At this stage, we did not consider the effect of individual concatenations (i.e., experimental configurations) yet. These results are shown in Fig. 5a. It can be noticed in this figure that groups A and B yielded the best overall results. This means that enabling PCA had a negative impact on the recall score, while using the scale and/or MinMax methods benefitted it. This also suggests that the CICAndMal2017 suffers from outliers and, then, PCA does not work properly for this dataset. As the next step in our analysis, we analyzed the experimental configurations within the groups A and B to find out which classifier on which concatenations showed the best results. This analysis identified that experimental configuration # 5 (i.e., concatenation of network and permission features with the scaling method, and the PCA disabled) and # 14 (i.e., concatenation of network and API features with the MinMax method, and the PCA disabled) achieved the highest recall score with a value of 99.2%.
To have a more complete baseline (against which to compare the crossvalidation experiments) and an overall better understanding of the performance of the different classifiers within the experimental configurations, we also reviewed their achieved accuracy score without cross-validation. This information is shown in Fig. 6a, which presents the average accuracy score for all groups. There, it can be noticed how accuracy followed a similar trend with respect to recall: That is, groups A and B also achieved the best average accuracy score. Thus, a similar breakdown analysis was performed in group A and B (depicted in Figs. 6b and 6c, respectively). Such exercise reported that the same experimental configurations (i.e. # 5 and# 14) were the best performers among the universe of evaluated ones: These experimental configurations achieved the best accuracy with a score of 98.7%. However, it is worth mentioning that random forest (part of the experimental configuration #4) also achieved a 98.7% accuracy. After further analysis, it became clear that random forest obtained a better precision score, resulting in a better accuracy score. Meanwhile, gradient boosting showed a better recall score. Also, overall, the scaling method had a positive impact on this score, whereas PCA tended to reduce it. Finally, the best combination of static and dynamic features at this stage was network and permission features and network and API features, based on this dataset.
Fig. 6. [Images not available. See PDF.]
Accuracy scores without cross-validation. (a): Average accuracy score per group. (b): Accuracy breakdown for group A. (c): Accuracy breakdown for group B.
Evaluation with k-Fold Cross-Validation
After adding cross-validation to our model evaluation, we repeated the 28 experimental configurations (i.e. those shown in Table 1). The obtained recall results are shown in Fig. 7a. As the figure shows, adding cross-validation to the main model still follows the previously observed results pattern. This means that results in the groups A and B also achieved the best average score. In other words, still using the scaling or the MinMax method without using the PCA continues being the best set-up for this experiment. As next steps, we compared the results of the experimental configurations 1 to 14 (i.e., those belonging to groups A and B). This information is shown in Figs. 7b and 7c. There, it can be noticed how applying cross-validation reduced the overall recall scores achieved. Besides, the best score occurred on the concatenation of network, permission, and API call features, a combination which achieved a value of 93.33%.
Fig. 7. [Images not available. See PDF.]
Recall scores with cross-validation. (a): Average recall score per group. (b): Recall breakdown for group A. (c): Recall breakdown for group B.
Similarly to the previous experiment, we also analyzed the accuracy score achieved by the different experimental configurations in order to complement our analysis. The obtained results (depicted in Figs. 8a–8c) showed that, similarly to the recall score, in general, the accuracy scores with crossvalidation followed the pattern previously observed on the experimental configurations without cross-validation. That is, the best accuracy scores occurred when the scale or MinMax is enabled, and the PCA has been disabled.
Fig. 8. [Images not available. See PDF.]
Accuracy scores with cross-validation. (a): Average accuracy score per group. (b): Accuracy breakdown for group A. (c): Accuracy breakdown for group B.
It is worth mentioning that, overall (and somehow expected), by applying sklearn’s cross_validate function that implements a k-fold cross-validation, both recall and accuracy score were reduced (about 6%). However, achieving good results (comparable to those obtained before cross-validation) were still possible when network and permission, network and API, or network, permission, and API features were combined.
Evaluation with k-Fold Cross-Validation and Parameter Tuning
Until now, the tested experimental configurations of the two previous experiments used the default values of the respective machine learning algorithms’ hyperparameters. Thus, as a final step, we conducted a third experiment which utilized the sklearn’s function RandomizedSearchCV to find out the best possible values for the hyperparameters, and avoid any potential bias on their configuration. The obtained results are discussed next.
The execution was similar to the previous experiments: Other than changing our code slightly to apply the RandomizedSearchCV function, we re-executed all 28 possible experimental configurations (i.e., those presented in Table 1). Our analysis also followed the same approach previously discussed: That is, we started clustering the results into groups in order to compare the average recall score achieved by the groups, followed by conducting a breakdown within the groups that obtained the highest average recall to identify those experimental configurations which actually achieved the best score. The average recall scores are shown in Fig. 9a. There, it can be noticed that groups A and B outperformed the other two groups again.
Fig. 9. [Images not available. See PDF.]
Recall scores with cross-validation and parameter tuning. (a): Average recall score per group. (b): Recall breakdown for group A. (c): Recall breakdown for group B.
Next, we analyzed the individual experimental configurations within these two groups (results depicted in Figs. 9b and 9c). As these figures show, the experimental configurations # 5 (i.e., concatenation of network and permission features) and # 14 (i.e., concatenation of network and API features) obtained the best recall scores (i.e., 93.67% for # 5 and 92.96% for # 14). To complement our analysis, we also reviewed the performance of the experimental configurations with respect to the accuracy score (as it is a complementary metric for this work). In general, the results (shown in Figs. 10a, 10b, and 10c for the average accuracy, group A, and group B, respectively) reinforced our previous observations and findings, as they were quite similar to the recall score and followed the same trends.
Fig. 10. [Images not available. See PDF.]
Accuracy scores with cross-validation and parameter tuning. (a): Average accuracy score per group. (b): Accuracy breakdown for group A. (c): Accuracy breakdown for group B.
Relevant Dynamic and Static Features for Ransomware Detection
Regarding the second aim of this work, the set of important static and dynamic features identified in our experiments are discussed next:
As the gradient boosting classifier (used in the experimental configurations #5 and 14) achieved the overall best results, we concentrated our analysis on those two configurations to identify which features (among the universe of possible features) were the relevant ones used there. The results of our analysis are presented in Table 2, which lists all the important (i.e., used by this classifier) API, network, and permission features. To further validate our insights, this information was compared against the features suggested in other previous works in the area [9, 15, 16, 18, 21]. This exercise reinforced our observation about the relevance of these features. For instance, permission features such as Wake Lock, Write Setting, Read Setting and Write External Storage, SEND Message, are Package Install are mentioned by [15, 18, 21]. Regarding network features, the results presented in [9, 16] offer similar insights to ours: Except for the first 7 network features (presented in Table 2), the other features are also identified as relevant in that work. Regarding API features, the closest work to ours is [15]. Thus, we compared our results to theirs. In this case, there was no overlap among the results. This is likely caused by the fact that previous work focuses on malware detection instead of ransomware detection (like ours).
Table 2. . Important features for ransomware detection
# | API feature | Network feature | Permission feature |
|---|---|---|---|
1 | reset | Flow Duration | WAKE_LOCK |
2 | getAllCellInfo | Total Fwd Packets | ROOT_RECOVERY_STATE |
3 | discoverServices | Flow IAT Std | RAISED_THREAD_PRIORITY |
4 | disable | Flow IAT Max | WRITE_SETTINGS |
5 | addNmeaListener | Fwd IAT Total | READ_CALENDAR |
6 | getName | Fwd PSH Flags | MIPUSH_RECEIVE |
7 | Fwd URG Flags | RECEIVE_BOOT_COMPLETED | |
8 | Bwd URG Flags | BIND_DEVICE_ADMIN | |
9 | Fwd Header Length | INTERNAL_SYSTEM_WINDOW | |
10 | Bwd Header Length | READ_SETTINGS | |
11 | Packet Length Std | WRITE_HISTORY_BOOKMARKS | |
12 | Bwd Packet Length Std | SEND_MESSAGE | |
13 | Subflow Bwd Packets | ACCESS_MTK_MMHW | |
14 | Subflow Fwd Bytes | FB_APP_COMMUNICATION | |
15 | Bwd Packet Length Min | ACCESS_WAKE_LOCK | |
16 | Flow Packets/s | WRITE_INTERNAL_STORAGE | |
17 | Flow IAT Mean | BROADCAST_PACKAGE_INSTALL | |
18 | Max Packet Length | SEND_DOWNLOAD_COMPLETED_INTENTS | |
19 | PSH Flag Count | ||
20 | Bwd Avg Bulk Rate | ||
21 | Flow IAT Min | ||
22 | Fwd IAT Mean | ||
23 | Bwd PSH Flags | ||
24 | Fwd Packets/s | ||
25 | Packet Length Mean | ||
26 | Packet Length Variance | ||
27 | ACK Flag Count |
In summary, based on our results and analysis of the selected features, it seems that a concatenation of the identified subset of network and permission features (or network and API features) presented in Table 2 are the most useful ones and can be leveraged to achieve a very high recall and accuracy scores when aiming to detect ransomware in Android (based on the CICAndMal2017 dataset).
CONCLUSIONS
Nowadays, ransomware is one of the major cybersecurity threats and the Android ecosystem is not the exception. The main goal of this research was two-fold: On one hand, to develop a machine learning model that improves the detection of ransomware in Android (with a special emphasis on avoiding the false negatives due to their potential negative impacts). On the other hand, to find a combination of hybrid (i.e., static and dynamic) features which may have the biggest effect on ransomware detection. More specifically, this work utilizes permissions, intents, and API calls (as static features), as well as network traffic flow (as dynamic features). Furthermore, we selected CIC-Andmal2017 as our training/validation dataset because it was the best fit to our research problem.
Our proposed model consists of a preprocessing phase and an analysis one. Preprocessing includes extracting static features by the Androguard library in Python, creating the Vector-Matrices, and calculating the main value of the network features. Then, the model works on a different concatenation of inputs by applying different data processing methods. In terms of data split, 80% of the dataset is used for training and 20% for testing. Then, it uses the Sklearn feature selection module and passes the dataset to 5 different machine learning classifiers. Overall, 28 different set of results are generated (based on different combinations of experimental setups).
Among all experimental results, the combination of network and permission features (with scaling method, and disabled PCA flag) and the combination of network and API features (with MinMax method, and disabled PCA flag) showed the best overall performance, reflected by an achieved recall score of 99.2% by the gradient boosting algorithm. Among the 5 classifiers evaluated, Gradient Boosting has obtained the best recall score, and the best concatenations were either network and permission features and network and API features reported 97%. Compared to the literature discussed in the related work section, our model also outperformed them, as the best score reported by [9] was 49.5% (with random forest), while the authors of [10] reported 83.5% (also with random forest), and [16] reported 97% (with long short-term memory).
Additionally, our experiments allowed us to identify and collect a list of the most important attributes among the universe of evaluated API, permission, and network features. To better contextualize this work and its results, it is worth mentioning that a relevant limitation of it lies in its exclusive reliance on network features as dynamic features, neglecting considerations of resource consumption. This design decision was taken in order to limit the number of features to be explored, as well as to offer a baseline of the performance of the model leveraging exclusively network features. We plan to address this limitation as part of our future work, where we will gradually explore what other types of dynamic features might be relevant for our research problem. Additionally, the utilization of sklearn (i.e., the Python machine learning library used) naturally introduced certain constraints related to its functionality, such as the inability to integrate the bagging algorithm with its feature selection module. Furthermore, the relatively modest size of the dataset poses challenges, with the original 426 malware samples dwindling to 299 for nonransomware and 73 for ransomware during the feature extraction process of the main model, resulting in a total of 372 samples. This has been compensated by significantly diversifying the experimental configurations evaluated in this work, as illustrated by the valuable insights offered with respect to the evaluated models and algorithms. Finally, our results suggest a list of permissions (18 features), network flow (27 features), and APIs (6 features) which are important for detecting ransomware in the Android ecosystem; while the original set of extracted features within the model were 281, 114, 528, and 77 unique features for permissions, API calls, intents, and network (respectively).
As future work, we plan to further evaluate our model with other datasets, so that we can obtain more generalizable results and/or a better understanding of the scenarios where our model works appropriately (and when it does not). We will also explore adding on more dynamic features such as system logs, and device status (e.g., CPU usage, memory utilization) to evaluate their usefulness for the detection of Android ransomware. Finally, we plan to make our model publicly available as either a web application or a callable microservice API.
FUNDING
This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.
CONFLICT OF INTEREST
The authors of this work declare that they have no conflicts of interest.
Publisher’s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
AI tools may have been used in the translation or editing of this article.
REFERENCES
1 Teodorescu, C.A., Ciucu Durnoi, A.-N., and Vargas, V.M., The rise of the mobile Internet: Tracing the evolution of portable devices, Proc. Int. Conf. on Business Excellence, Bucharest, 2023, vol. 17, pp. 1645–1654.
2 Vijay, A., Portillo-Dominguez, A.O., and Ayala-Rivera, V., Android based smartphone malware exploit prevention using a machine learning-based runtime detection system, Proc. 10th IEEE Int. Conf. in Software Engineering Research and Innovation (CONISOFT), San José Chiapa, 2022, pp. 131–139.
3 Selvaganapathy, S.G.; Sadasivam, G.S.; Ravi, V. A review on Android malware: Attacks, countermeasures and challenges ahead. J. Cyber Secur. Mobility; 2021; 10, pp. 177-230.
4 Kivva, A., IT threat evolution in Q2 2023. Mobile statistics. https://securelist.com/it-threat-evolution-q2-2023-mobile-statistics/110427/. Accessed August 30, 2024.
5 Vulfin, A.M. Detection of network attacks in a heterogeneous industrial network based on machine learning. Program. Comput. Software; 2023; 49, pp. 333-345. [DOI: https://dx.doi.org/10.1134/S0361768823040126]
6 Kozachok, A.V. Formal logical language to set requirements for secure code execution. Program. Comput. Software; 2017; 43, pp. 314-319.
7 Alraizza, A.; Algarni, A. Ransom ware detection using machine learning: A survey. Big Data Cognitive Comput.; 2023; 7, 143. [DOI: https://dx.doi.org/10.3390/bdcc7030143]
8 Chittooparambil, H.J., Shanmugam, B., Azam, S., Kannoorpatti, K., Jonkman, M., and Samy, G.N., A review of ransom ware families and detection methods, in Proc. 3rd Int. Conf. on Reliable Information and Communication Technology (IRICT 2018) Recent Trends in Data Science and Soft Computing, Springer, 2019, pp. 588–597.
9 Lashkari, A.H., Kadir, A.F.A., Taheri, L., and Ghorbani, A.A., Toward developing a systematic approach to generate benchmark Android malware datasets and classification, Proc. IEEE Int. Carnahan Conf. on Security Technology (ICCST), Montreal, 2018, pp. 1–7.
10 Taheri, L., Kadir, A.F.A., and Lashkari, A.H., Extensible Android malware detection and family classification using network-flows and API-calls, Proc. IEEE Int. Carnahan Conf. on Security Technology (ICCST), Chennai, 2019, pp. 1–8.
11 Getman, A.I.; Ikonnikova, M.K. A survey of network traffic classification methods using machine learning. Program. Comput. Software; 2022; 48, pp. 413-423. [DOI: https://dx.doi.org/10.1134/S0361768822070052]
12 Holmes, G., Donkin, A., and Witten, I.H., Weka: A machine learning workbench, Proc. ANZIIS’94 – IEEE Australian New Zealnd Intelligent Information Systems Conf., Brisbane, 1994, pp. 357–361.
13 Chebyshev, V., Mobile Malware Evolution 2019. https://securelist.com/mobilemalware-evolution-2019/96280. Accessed 2024-08-30.
14 Loper, E. and Bird, S., Nltk: The natural language toolkit, 2002. arXiv:cs/0205028.
15 Surendran, R., Thomas, T., and Emmanuel, S., A tan based hybrid model for Android malware detection, J. Inf. Secur. Appl., 2020, vol. 54, p. 102483.
16 Bibi, I., Akhunzada, A., Malik, J., Ahmed, G., and Raza, M., An effective Android ransomware detection through multi-factor feature filtration and recurrent neural network, Proc. 2019 UK/China Emerging Technologies Conf. (UCET), Stratford-upon-Avon, 2019, pp. 1–4.
17 Shiri, F.M., Perumal, T., Mustapha, N., and Mohamed, R., A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU, 2023. arXiv:2305.17473.
18 Alsoghyer, S. and Almomani, I., On the effectiveness of application permissions for Android ransomware detection, Proc. 6th IEEE Conf. on Data Science and Machine Learning Applications (CDMA), Riyadh, 2020, pp. 94–99.
19 Scalas, M.; Maiorca, D.; Mercaldo, F.; Visaggio, C.A.; Martinelli, F.; Giacinto, G. On the effectiveness of system API-related information for Android ransomware detection. Comput. Secur.; 2019; 86, pp. 168-182. [DOI: https://dx.doi.org/10.1016/j.cose.2019.06.004]
20 Abdullah, Z., Muhadi, F.W., Saudi, M.M., Hamid, I.R.A., and Foozy, C.F.M., Android ransomware detection based on dynamic obtained features, in Proc. 4th Int. Conf. on Soft Computing and Data Mining (SCDM 2020), Recent Advances on Soft Computing and Data Mining, Melaka, Malaysia, Jan. 22–23,2020, Springer, 2020, pp. 121–129.
21 Pehlivan, U., Baltaci, N., Acartürk, C., and Baykal, N., The analysis of feature selection methods and classification algorithms in permission based Android malware detection, Proc. IEEE Symp. on Computational Intelligence in Cybersecurity (CICS), Orlando, FL, 2014, pp. 1–8.
22 Kannengiesser, U.; Gero, J.S. Modelling the design of models: an example using crisp-dm. Proc. Design Soc.; 2023; 3, pp. 2705-2714. [DOI: https://dx.doi.org/10.1017/pds.2023.271]
23 Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., and Rieck, K., Drebin: Effective and explainable detection of Android malware in your pocket, Proc. Network and Distributed System Security (NDSS) Symp., San Diego, 2014.
24 Allix, K., Bissyandé, T.F., Klein, J., and Le Traon, Y., Androzoo: Collecting millions of Android apps for the research community, Proc. 13th Int. Conf. on Mining Software Repositories, MSR’16, New York: Association for Computing Machinery, 2016, pp. 468–471.
25 Maiorca, D.; Ariu, D.; Corona, I.; Aresu, M.; Giacinto, G. Stealth attacks: An extended insight into the obfuscation effects on android malware. Comput. Secur.; 2015; 51, pp. 16-31. [DOI: https://dx.doi.org/10.1016/j.cose.2015.02.007]
26 VirusShare Dataset. https://github.com/seifreed/VirusShare. Accessed August 30, 2024.
27 Koo Ping Shung, Accuracy, precision, recall or F1, Towards Data Sci., 2018, vol. 15, no. 03.
28 sklearn – preprocessing. https://scikit-learn.org/stable/modules/preprocessing.html. Accessed August 30, 2024.
29 Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev.: Comput. Stat.; 2010; 2, pp. 433-459. [DOI: https://dx.doi.org/10.1002/wics.101]
30 Zhou, J.; Gandomi, A.H.; Chen, F.; Holzinger, A. Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics; 2021; 10, 593. [DOI: https://dx.doi.org/10.3390/electronics10050593]
31 Anguita, D., Ghelardoni, L., Ghio, A., Oneto, L., Ridella, S., et al., The ’K’ in k-fold cross validation, Proc. European Symp. on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, 2012, vol. 102, pp. 441–446.
Copyright Springer Nature B.V. Dec 2024