Content area
Many binary code analysis tools rely on intermediate representation (IR) derived from a binary code, instead of working directly with machine instructions. In this paper, we first consider binary code analysis problems that benefit from IR and compile a list of requirements that the IR suitable for solving these problems should meet. Generally speaking, a universal binary analysis platform requires two principal components. The first component is a retargetable instruction decoder that utilizes external specifications to describe target instruction sets. External specifications facilitate maintenance and allow one to quickly implement support for new instruction sets. We analyze some of the most popular instruction set architectures (ISAs), including those used in microcontrollers, and from that compile a list of requirements for the retargetable decoder. We then overview existing multi-ISA decoders and propose our vision of a more generic approach, based on a multi-layer directed acyclic graph that describes the decoding process in universal terms. The second component of the analysis platform is the actual architecture-neutral IR. In this paper, we describe such IRs and propose Pivot 2, an IR that is low-level enough to be easily constructed from decoded machine instructions, also being easy to analyze. The main features of Pivot 2 are explicit side effects, SSA variables, simpler alternative to phi-functions, and extensible elementary operation set at the core. This IR also supports machines that have multiple memory address spaces. Finally, we propose a way to tie the decoder and the IR together to fit them to most of the binary code analysis tasks through abstract interpretation on top of the IR. The proposed scheme takes into account various aspects of target architectures that are overlooked in many other works, including pipeline specifics (handling of delay slots, hardware loop support, etc.), exception and interrupt management, and generic address space model, in which accesses may have arbitrary side effects due to memory-mapped devices or other non-trivial behavior of the memory system.
INTRODUCTION
Both software developers and cyber security experts face binary code analysis problems. The need for analysis at the level of binary code is not just due to unavailability or partial loss of the source code. For a high-level programming language, it is difficult or impossible to assess the damage that can be done by a software defect. Optimizing code transformations carried out by modern compilers can introduce serious and potentially exploitable defects when they encounter certain aspects of program behavior that go beyond the specification of the programming language [1]. Developers have to work with executables when using dynamic analysis for bug detection, when debugging and testing, as well as in some code reuse problems. Security experts also have to do so when searching for undeclared software features and estimating software security.
Presently, there are many tools that automate (to various degrees) binary code analysis when solving practical problems. A commonplace for many of these tools is the use of intermediate representation (IR), into which the code to be analyzed is translated in the same manner as it is done in the compiler. Historically speaking, this method of code processing was employed in binary translation, binary instrumentation [2], and analysis problems.
Popular processor architectures have quite extensive instruction sets. This is why it is reasonable to implement complex analysis at the level of IR, which allows “large” instructions of the target architecture to be simplified by dividing into smaller components. Translation into IRs of various target architectures also makes it possible to facilitate the adaptation of the analysis tool to a new target architecture: for this purpose, it is sufficient to develop the corresponding translation module.
Along with these benefits, the practical application of IRs also faces certain difficulties.
The first works [3] borrowed IRs directly from the compiler, e.g., LLVM [4]. Unfortunately, this approach proved ineffective because binary code does not include an essential component of LLVM—data types—and they cannot be restored in the process of translation.
The works of the next “wave” were devoted to the development of specialized IRs that meet the requirements of binary code analysis: Vine [5], Pivot [6], BAP [7], and REIL [8]. These IRs were originally designed to be derivable from binary code for the purpose of analyzing properties. The problem was that different software tools used IRs of different types and degrees of detail, and translation functions needed to be implemented each time from scratch. In addition to the obvious increase in the efforts for the development of new tools, when using several tools, the results not always match because different tools may have different flaws in processing individual instructions and their characteristics.
Specialization allowed IRs to be quite successfully applied to general-purpose processor architectures (x86, ARM, and MIPS). However, these IRs failed to describe certain aspects of the behavior of these machines and some other less customary (but not less popular) processor architectures. This was primarily due to the specifics of exception handling, instruction pipelining, and memory access. In addition, the information security of the Internet of Things (IoT) devices makes the analysis of microcontroller code increasingly important, while the corresponding capabilities of presently available software tools prove insufficient.
In this paper, we propose a new IR that, on the one hand, is specialized for solving binary code analysis problems and, on the other hand, is universal as it can vary the level of detail when describing the operation of a wide class of processor architectures (both general-purpose ones and those used in microcontrollers).
The paper is organized as follows. Section 2 considers the most common binary code analysis problems to compile a list of IR requirements. Section 3 discusses specific features of various processor architectures that also need to be taken into account when designing an IR. Section 4 describes some existing solutions for machine instruction decoding and presents a new decoding method, which is an integral part of the proposed approach. Section 5 describes our approach, as well as overviews some presently available approaches to modeling the operational semantics of binary codes. Section 6 discusses the general scheme of the proposed solution. Section 7 contains the conclusions and outlines the directions for further research.
TYPICAL SCENARIOS OF BINARY CODE ANALYSIS
To formulate the requirements to a universal IR for binary code analysis, it is required to consider some common problems. Figure 1 classifies these problems at different levels, based on their nature (target or auxiliary problem). Based on the type of operations on the IR, they can be conventionally grouped into three categories: abstract interpretation, concrete execution, and symbolic execution.
Fig. 1. [Images not available. See PDF.]
Binary code analysis problems that benefit from using an intermediate representation.
Concrete Execution
Emulation based on dynamic binary translation (DBT) is a common practice of using the IR derived from a binary code. The emulated instructions are linearly decoded up to a branch, the resulting code fragment (translation block) is translated into the IR, some optimizations are carried out, and the IR is translated into host machine code [9]. To maintain the overall performance of the emulator, it is necessary to achieve a tradeoff between optimization quality and associated overheads. That is why the use of IRs is limited to “light” optimizations: dead code elimination, constant folding and propagation, live variable analysis, etc.
In the problems of debugging automation (memory access debugging, race detection, etc.), as well as in the process of profiling, translation is supplemented with instrumentation. The most famous systems are Valgrind [2], Pin [10], and DynamoRIO [11].
Generally speaking, there are two approaches to instrumentation [2]. The copy-and-annotate approach decodes program instructions at runtime with some of them (depending on a problem to be solved) being replaced by calls of auxiliary functions that perform additional processing; in some cases, operands are replaced, etc. This approach is employed in Pin and DynamoRIO. The disassemble-and-resynthesize approach is similar to the DBT with the in-depth parsing of the operational semantics for each instruction: instrumentation is carried out on top of the IR, which is optimized and then used to generate a host machine code. This approach, implemented in Valgrind, is significantly more flexible; however, it implies higher overheads.
In these scenarios, the IR is required to be suitable for basic optimizations: if possible, it should have no side effects, comply with static single assignment (SSA) form, and provide a simple yet efficient API to instruction decoding results (opcode, operands, addressing mode, etc.).
Symbolic Execution
Symbolic execution is a more general program execution model; it is confined to one path, but it replaces concrete values of variables with abstract (symbolic) ones. Generally, formulas that determine relationships between these values are quantifier-free first-order predicates over arrays of bit vectors (QF_ABV).
Symbolic execution is used for automatic detection of software defects and estimation of how critical they are. The architecture of the corresponding software tools (S2E [12] and Mayhem [13]; see also the works by A. Fedotov and V. Kaushan [14]) has become well-established: it is a chain of code analysis and transformation operations of several types (see Fig. 2).
Fig. 2. [Images not available. See PDF.]
Generic architecture of a symbolic analysis tool.
Program execution with symbolic state support is enabled by binary instrumentation tools (Valgrind [2], Pin [10], and QEMU [9]). Dynamic taint analysis selects instructions that process user (symbolic) input. Then, the selected instructions are translated into the IR. For the IR, symbolic interpretation rules that yield expressions over symbolic variables and constants are specified. To describe the current execution path, all conditional branches traversed are translated into constraints over symbolic expressions, which allows one to obtain a path predicate. The path predicate, supplemented with a security predicate (a formal description for a certain type of errors), is sent to an SMT solver.
If the IR operations prove “similar” to the QF_ABV logic, then the cost of the second translation can be reduced, and the rules prove trivial.
It should be noted that it is possible to construct a symbolic analysis architecture with a different sequence of operations, whereby the entire code is translated into an IR, and symbolic values are tracked directly on it, without the intermediate step of tainted data tracking.
Abstract Interpretation
A disassembler is the main tool for manual binary code analysis. It decodes instructions from a code segment in one of two ways: by sequential analysis from the beginning of the segment or by recursive descent from entry points [15]. The second method is more popular because it can be used in the case of gaps or data (e.g., jump tables) between subroutines. Speculative disassembly [16] is also employed. In this case, each offset in the code segment is regarded as a possible beginning of a function’s code; then, unsuitable candidates are screened out at the stage of conflict resolution.
Thus, the solution of the disassembly problem requires at least a machine instruction decoder with its output being represented in the assembler form. The approaches based on recursive descent additionally require classification of instructions based on the type of branching, as well as calculation of jump target addresses. One way to calculate possible execution addresses is to use the VSA algorithm [17], which can approximate addresses and integers. In addition to detection of code fragments (based on jump addresses), it enables improved detection of defects and vulnerabilities by using static analysis.
To find defects and vulnerabilities in a binary code, static analysis and symbolic execution are employed. For instance, BINSIDE [18] uses the intermediate representation REIL [8] for subsequent translation into LLVM bit code [4], on top of which certain types of defects are found with the help of checkers.
Essentially, both static analysis and symbolic execution are flavors of abstract interpretation [19]. The difference is that, in the former case, all paths in the program are analyzed simultaneously (as a rule, transfer functions are iteratively applied to find a certain solution corresponding to a fixed point, the so-called MFP solution); in the latter case, a certain subset of paths is analyzed with each path being processed independently. When all possible paths can be analyzed, the corresponding solution (MOP solution) proves more accurate than the MFP solution. However, in practice, this is possible only for the most primitive programs.
Regardless of the way in which abstract interpretation is used, the analysis infrastructure requires instruction decoding and simplification to facilitate the description of the lattices and transfer functions that define this interpretation. That is why, to detect defects and vulnerabilities in a binary code, quite powerful IRs are most commonly employed. In addition to BINSIDE, similar problems are solved using BitBlaze [5] and BAP [7]; each of these systems has its own IR to describe semantics of machine instructions.
SPECIFIC FEATURES OF TARGET PROCESSOR INFRASTRUCTURES
Our experience suggests that, when conducting dynamic analysis of a binary code [20], all general-purpose processor architectures prove fairly similar in terms of instruction sets and their coding, as well as the operation of their pipelines and memory subsystems. However, microcontrollers and DSPs significantly differ from general-purpose processors: they belong to non-von Neumann architectures, provide discontinuous instruction placement in memory, lack pipeline stall in the case of data or control hazards, etc. This section discusses the most important features of some target processor architectures that should be taken into account when developing IRs for binary code analysis.
ARM Family
The ARM family of processor architectures is currently represented by ARMv7 and ARMv8, which, in turn, are divided into profiles A, R, and M. Depending on its model, an ARM processor can support up to three instruction sets: A32 and A64 sets with fixed instruction lengths and T32 set with variable instruction length. At any instant, the currently used instruction set is determined by the machine’s control registers and can be changed at runtime by using special instructions.
Among its specific features, we can point to the implementation of conditional execution. In the A32 set, almost all instructions have the cond field that specifies conditions for instruction execution. In the A64 set, only a small number of instructions have this field. In the T32 set, in addition to several instructions that have explicit cond fields, there is an IT instruction that, depending on the computed value of the condition, determines which of the instructions from the next conditional block (of four instructions) are executed for a true value and which are executed for a false value.
Another feature is co-processors, which operate using MCR and MRC instructions. Specialized solutions based on the ARM architecture, in addition to standard co-processors, can have various extensions.
AVR Family
The AVR family of microcontrollers has a very small instruction set. Instructions can be 16 or 32 bits long; the length of the instruction can be uniquely determined by a bit prefix. AVR microcontrollers have separate code memory and data memory spaces, with a part of the latter being used to access peripherals. Thus, the main characteristic of these microcontrollers is that they belong to non-von Neumann architectures.
MIPS Family
The MIPS family of processor architectures is a typical representative of RISC: instructions with simple structure have a fixed length and fairly simple semantics, so they can be easily decoded and their behavior can be easily analyzed. However, since there are many versions of the MIPS, along with its various extensions, without exact knowledge of the release number and set of extensions implemented, some instructions cannot be uniquely decoded. The release number of the architecture can be read from system registers (note that some emulators inadequately implement this function); however, the set of extensions supported by a particular machine needs to be obtained from the outside.
The MIPS pipeline supports delay slots, which should be taken into account when processing jump instruction.
PowerPC Family
The PowerPC family of processor architectures is another typical representative of RISC: all instructions have fixed length with a small number of coding formats (i.e., field arrangements). The semantics of instructions is simple, with the only distinctive feature being the explicit setting of the flag for arithmetic instructions that enables or disables the update of the status word in the XER register.
RISC-V Family
A new RISC-V family of processor architectures is designed in such a way as to facilitate (as much as possible) the hardware implementation of the processors and tools that work with RISC-V binary code. Instructions have the same length and are explicitly divided into instruction subsets (any implementation can support only some of them) with the bit mask determining to which subset a particular instruction belongs. Most of the subsets support only four instruction encoding formats, which facilitates decoding. However, there is one exception: the C extension of the instruction set supplements “standard” 32-bit instruction encoding with 16-bit encoding, which enables a more concise encoding of some instructions from the basic set. The length of the instruction is always determined by two bits in its code.
Texas Instruments Family
The Texas Instruments (TMS) family of processor architectures is represented by a large number of microcontrollers. Microcontrollers with explicit parallelism at the instruction level, e.g., TMS320C67x, have the most interesting features. In their case, the fetch unit is a package of eight instructions rather than one instruction. The encoding of the package indicates which of its instructions can be executed in parallel. It is also important that the processor supports jumps to the middle of a package, which corresponds to its partial execution.
All TMS architectures have delay slots; hence, jump instructions require special processing.
X86 Family
The x86 (x86-64) family of processor architectures is currently the most popular general-purpose solution; hence, its binary code most often becomes the object of analysis. The x86 family belongs to CISC architectures and supports a large number of instructions, multiple addressing modes, several variants of encoding for the same instructions (e.g., with or without the VEX prefix), variable instruction length (which cannot be computed by any fixed-size prefix), etc. Many instructions (especially system ones) are fairly complex and need to be simplified in the process of analysis by dividing them into smaller components. Thus, the full-fledged support of the x86 architecture is not an easy task in terms of implementing the instruction decoder and describing the operational semantics of its instructions.
On the other hand, from a developer’s perspective, the x86 processor pipeline is quite simple and does not have any major properties that affect code analysis. This also applies to the memory system: its complexity consists only in the number of successive address translations with each individual stage being quite straightforward. A certain exception is new instruction sets associated with transactional synchronization extensions (TSX) and software guard extensions (SGX). However, depending on a particular problem, their processing can often be significantly simplified without loss of accuracy.
Xtensa Family
The Xtensa family of microcontrollers is widely employed for implementation of IoT devices because Xtensa-based solutions support 802.11 networks (WiFi) and Bluetooth out of the box. The basic set of instructions is rather small (less than 100 simple instructions); instructions have a fixed length of 24 bits and simple encoding. Like the C extension in RISC-V, some Xtensa microcontrollers support short 16-bit notation for some instructions. Some models support hardware loops: the LBEG register determines the address for the beginning of the loop, the LEND register determines the end address, while the LCOUNT register contains the number of remaining operations. In the process of instruction fetching, the processor automatically decrements the LCOUNT value when reaching the LEND address and, if necessary, jumps to the LBEG address. This behavior should be taken into account when analyzing the control flow of Xtensa code.
Xtensa processors can also operate in conjunction with external co-processors, which is why the full-fledged analysis of binary code for this architecture requires knowing which instructions are added by co-processors to the basic set.
Requirements for Intermediate Representation
In summary, we formulate a set of requirements that the IR for binary code analysis must meet in order to be suitable for the full-fledged analysis of all processor architectures mentioned above.
1. It must support non-von Neumann architectures with multiple memory address spaces.
2. Some architectures support hardware loops and other mechanisms that affect instruction fetching. Therefore, these mechanisms must be taken into account when interpreting and analyzing the control flow.
3. Many RISC architectures have delay slots; their number, as well as the behavior of jump instructions in terms of executing or discarding instructions in delay slots, may differ for different architectures. This information also affects both interpretation and analysis of the control flow.
4. It must support variable-length instruction encodings.
5. In many processor architectures, instruction decoding depends on values of control registers (e.g., the current instruction set in ARM, default operand size in x86, etc.). The decoder must have access to the context from which this information can be obtained.
6. Some processor architectures have different versions and sets of extensions. This information must also be used for decoding; otherwise, unambiguous decoding of instructions is impossible.
7. It must support processor architectures with explicit parallelism at the instruction level.
DECODING MACHINE INSTRUCTIONS
Any IR for binary code analysis is constructed in two steps. The first step is decoding the instructions of a certain program or system, while the second step is translating them into some unified form to be analyzed. In this section, we discuss the first step taking into account the requirements formulated above.
Overview of Existing Solutions
Presently available binary decoding tools can be classified into two main groups. The first group includes decoders that support a particular architecture (or a family of architectures). These tools usually have a fairly detailed representation of instructions that, in addition to the text of the instruction, describes its structure in terms of mnemonics, prefixes, flags, operand addressing mode, etc. At the same time, the fact that different target architectures use different decoding libraries with different APIs means that tools adapted to a particular processor architecture require additional unification of these APIs, which can reduce the benefits of using ready-made libraries.
The second group includes tools that support a number of processor architectures and yield universal structures that do not depend on a particular architecture. Currently, the most popular solutions are binutils libraries [21], Capstone library [22], and IDA Pro interactive disassembler [23]. Let us consider them in more detail.
The decoding library from binutils is used mostly to obtain the textual representation of a decoded instruction. The decoding logic is described by C functions without external specifications. It is most often represented as a cascade of switch operators, which complicates the debugging and modification of the decoder. The main advantage of binutils is a large number of supported architectures. However, the quality of this support varies: decoders for popular general-purpose architectures are fine-tuned, whereas the quality of instruction decoding for microcontrollers is significantly lower.
The Capstone library allows one to obtain both the textual representation of an instruction and its structure that uniformly describes its basic properties (i.e., it does not depend on the target architecture). These properties include the membership of the instruction to a certain group (e.g., group of jump instructions) and a list of registers read and written by the instruction. A number of architectures are supported, including ARM, M68K, MIPS, PowerPC, TMS320C64x, and x86. At the same time, the library has some architectural limitations that do not allow it to be used for solving all instruction decoding problems. Some of the most significant limitations are listed below.
1. Detailed information about the instruction is available for each target architecture in the form of a data structure specific to this architecture, which significantly reduces the universality of the library in terms of interpreting the decoding results.
2. Even the structures with detailed information do not contain data on the number of delay slots and their processing method (these data are essential for all flow-sensitive types of binary code analysis).
3. Architectures with explicit parallelism at the instruction level are not supported.
4. It is impossible to obtain the list of memory areas accessed by the instruction, which restricts data flow analysis with this library only to registers.
Decoders in Capstone are implemented in C with the extensive use of macros, which complicates code maintenance.
Thus, the Capstone library is well suited for superficial analysis of a binary code (in particular, disassembly tasks and simple cases of control and data flow analysis). However, in its current version, it is not suitable for in-depth analysis.
The IDA Pro interactive disassembler is currently the tool of choice for automated analysis of a binary code when working with the assembler listing recovered from the binary code. The disassembler supports many architectures and has an SDK for implementing this support in the form of plug-ins. However, the quality of decoding significantly varies from plug-in to plug-in. In particular, information about the registers used by the instruction for reading and writing is available only for some architectures; for the other architectures, it is not implemented or works incorrectly. As for the SDK, due to its long history and the fact that IDA Pro’s APIs were initially designed only for x86 code, these APIs got rather complicated with many implicit contracts, etc.
Each decoding module is implemented in C++ or Python; like with the tools mentioned above, the decoding logic is described directly in the code. This tool does not provide any unified method for declarative description of instruction sets.
Thus, while being excellent solutions to particular problems, the decoding tools considered above are not universal. In the following subsection, we propose a decoding approach based on declarative description of instruction sets that is more flexible and, in our opinion, easier to maintain.
Proposed Decoding Approach
When developing this approach, in addition to the requirements that follow from the specific features of processor architectures, we also formulated a basic requirement for a unified decoder, i.e., the one that processes the declarative specification of a particular instruction set architecture (ISA) and provides the same API for all architectures.
Generally speaking, the specification of instruction encodings can be represented as a table that associates each instruction with a complete encoding (i.e., the one in which all bits have concrete values) and predicate for the current state of the machine. Figure 3 shows a fragment of this table for RISC-V processors. The predicate is needed to take into account instruction sets that are switched at runtime (e.g., in ARM processors), default sizes of operands (e.g., in x86 processors), composition of architectural extensions, and other similar characteristics that are not explicitly included in instruction encoding. Obviously, such a table has a large size; thus, when designing the decoding logic, the main goal is to structure this table by grouping its rows. If the predicate is set aside, then we have only a set of bit strings associated with different instructions. Data structures for fast searching in such data arrays are well known; the compressed trie, which groups bit strings based on common prefixes, is one of them (see Fig. 4). In practice, however, it is often impossible to determine the mnemonic of the instruction by using a non-trivial prefix. In particular, many instructions in RISC-V have the opcode field divided into two parts: some bits are located at the beginning and some are at the very end of the encoding. Thus, the trie’s structure needs to be adapted to this particular problem.
Fig. 3. [Images not available. See PDF.]
RISC-V decode table fragment.
Fig. 4. [Images not available. See PDF.]
RISC-V compressed trie fragment.
It should be noted that individual bit fields from instruction encodings do not have to be considered successively. If instruction encoding is divided into individual fields, then, generally speaking, bits in these fields can be compared with those recorded in the table in arbitrary order. This approach can be formalized as an automaton with a tree corresponding to it, where the edges indicate the values of the bit fields the coincidence with which permits transition to the corresponding vertex. Again, note that the order of the edges along the path from the root of the tree to the leaf does not matter, i.e., such trees can be transformed by rearranging their vertices without affecting the encodings described by them.
The structure of instruction encodings in processors is usually such that a certain general form of the instruction (opcode with modifiers, set of addressing modes, and sizes of operands) is decoded first, and then operands (particular registers, memory addresses, etc.) are defined. This structure depends on how the initial stages of the pipeline are implemented in hardware. Based on these considerations, we divide the tree into two layers. In the upper (nearer to the root) layer, we concentrate the checks of the bit fields that define the opcode and addressing modes of operands. In the lower (nearer to the leaves) layer, we concentrate the checks associated with the final definition of operands. In this two-layer tree, the lower layer contains a large number of duplicate patterns. By transiting from the tree to an acyclic graph, these duplicates can be eliminated. However, in this case, only a part of the decoding result can be stored in the leaves; the other part should be accumulated when traversing the upper layer. This graph is illustrated in Fig. 5 (for simplicity, the edges between the layers are omitted). The left-hand side of the figure depicts the layer for decoding the opcode and addressing modes of operands; the right-hand side depicts the layer for decoding operands. In this case, the set of operands to be decoded is determined by the leaf vertex of the first layer.
Fig. 5. [Images not available. See PDF.]
RISC-V decoder acyclic graph fragment.
Let us now return to the effect of machine state on instruction decoding. In should be noted that the corresponding predicate basically acts as a conjunct in the general form of the encoding, alongside with the conjuncts responsible for checking the values of the bit fields. Hence, these predicates can be added to the graph under consideration as another layer nearest to the root. The rule of traversing the edges in this layer is as follows: the transition is allowed if the predicate (indicated on the edge) for the values of the components of system registers and characteristic of a particular processor model takes the true value.
Once the graph described above is constructed for a certain processor, the instruction decoding result is represented as a set of data in its vertices corresponding to the predicate and instruction bit code or, in other words, as a path in this graph. The parts of the graph that describe operand decoding can be associated with the corresponding addressing modes. For each addressing mode, a set of operand characteristics that are not fixed in this mode is defined. In the case of RISC-V, this characteristic is the register number for the “register” addressing mode, the constant value for the “constant” mode, and the register number and offset value for the “indirect addressing” mode. Then, the part of the path in the graph that passes through the opcode layer defines the mnemonic and addressing modes of operands, while the part passing through the layer of operands completes the decoding of their characteristics. For example, the result of decoding the bit vector 000000000000_01001_010_01000_0000011 is a structure that contains the following information:
• instruction mnemonic (LW);
• the first operand has the “register” addressing mode and its “register number” characteristic has the value “x8;”
• the second operand has the “indirect addressing” mode;
• the “register number” characteristic of the second operand has the value “x9” and the offset is 0.
This representation of the decoding results makes it possible to
• obtain the textual representation of the instruction;
• given a path in the graph, restore the complete encoding;
• define, in a unified form, the formulas that select certain instructions based on their mnemonics and sets of operands, which is required when translating individual machine instructions into the IR that describes their operational semantics.
DESCRIBING THE OPERATIONAL SEMANTICS OF MACHINE INSTRUCTIONS
In the previous section, we have considered some decoding problems. Now, let us discuss the IR part that is responsible for describing the operational semantics of machine instructions.
Overview of Existing Solutions
As mentioned in Introduction, the emergence of the first IRs that sufficiently accurately described the operational semantics of binary code was primarily due to the problems of dynamic binary translation and instrumentation. For example, the full-system open-source emulator QEMU [9] is based on the binary translation of a guest code into the code of the machine on which emulation is carried out. The translation uses the TCG IR. The guest code is translated into this representation; then, a machine code is generated based on it. A similar approach is implemented in the dynamic binary instrumentation system Valgrind [2], which uses the VEX IR. The TCG and VEX IRs are quite similar in their basic characteristics and have been successfully employed for many years without significant modifications.
However, these IRs are not suitable for in-depth analysis of data and control flows in a binary code: both TCG and VEX support C helpers, which can arbitrarily change the state of the system under analysis. These functions are quite common because they are used to emulate complex processor instructions.
Another problem of these IRs is deferred flag processing: while being justified in the case of dynamic analysis (as it allows one to skip the evaluation of the flag that will not be used at a later stage), it significantly complicates static analysis and symbolic execution because this behavior breaks data dependence chains. Moreover, in both the systems, the steps of IR construction and code generation are implemented as C modules, which complicates their modification and debugging.
Specialized IRs for binary code analysis that are used in Vine [5] and BAP [7], as well as the REIL IR [8], do not support implicit actions, which is why they are more suitable for different analysis scenarios, even though they are more “wordy.” In this case, like in QEMU and Valgrind, translators are implemented as software modules and do not support declarative description of instruction sets. In addition, these IRs fix the set of elementary operations into which the behavior of the machine instruction is divided, with this set being limited: in their current versions, these IRs are suitable mostly for integer arithmetic instructions, bitwise logic instructions, and jump instructions. These IRs also do not support memory and pipeline models, and exception handling, which complicates their use for full-system analysis of binary code.
The Pivot IR [6] was originally designed for offline dynamic analysis of binary codes in x86 (x86-64), PowerPC, MIPS, and ARMv6. However, later, it was successfully used in some scenarios of static binary code analysis for these architectures. The key features of this IR are as follows:
• an arbitrary set of address spaces (including register spaces), which enables the support of non-von Neumann architectures and the separation of machine registers from temporary variables (virtual registers);
• temporary variables are in SSA form, which facilitates static analysis;
• the set of basic operations that express operational semantics is not fixed: depending on a particular task, the user can either define new operations or express the semantics of the instruction in terms of available ones;
• operations can have implicit side effects, which manifest themselves in reading or writing individual bits of the status word; these sets are specified for each operation, which is a certain tradeoff between the fully explicit description of all effects and the deferred evaluation of flags;
• a language is implemented to define external specifications of operational semantics; translators are generated automatically based on these specifications.
Even though this IR was successfully applied in a number of projects implemented by the Institute for System Programming of the Russian Academy of Sciences, with time, some of its principal limitations became apparent. First, in the process of its development, the description of the machine’s behavior “between instructions” (i.e., when fetching the next instruction, handling interrupts, etc.) was not taken into account.
Second, the simple memory model of various address spaces (byte arrays) used in Pivot is insufficient for full-system analysis: it is also necessary to take into account address translation (including multi-level one), memory-mapped devices, and direct memory access (DMA).
Third, the set of libraries for this IR includes only the translation and concrete interpretation means, while the entire logic of subsequent analysis (e.g., the iterative algorithm for finding the MFP solution to the data flow problem or the logic of symbolic state propagation) has to be implemented each time from scratch.
Finally, in the academic works that describe the operational semantics of machine instructions, we can single out various dialects of the nML language [24], as well as the languages ISDL [25], L3 [26], and SAIL [27]. The most interesting approach is implemented in L3 and SAIL: to describe both decoding and behavior of instructions, a dialect of a purely functional language is employed. Basically, the entire operation of the machine is described by one pure function, which receives the current machine state and instruction encoding as an input and returns a new machine state. Thus, the machine’s operation on executing a certain sequence of instructions can be represented as a composition of calls of this function. As compared to the machine code, the code of a purely functional paradigm is much easier to analyze, which is why the functional language is used as an IR.
Here, the disadvantages are the limited expressive power of the languages mentioned above (in particular, there are difficulties with the description of floating-point instructions, vector instructions, and system instructions) and the impossibility to specify instruction fetching specifics (delay slots and hardware loops), as well as interrupt and exception handling behavior, i.e., actions associated with hardware context switching. Moreover, non-von Neumann architectures are not supported.
Proposed Intermediate Representation
This subsection describes the proposed IR—Pivot 2—which is an improvement of the original IR described in [6].
Address spaces
With Pivot 2, all data for processor instructions are uniformly described as a set of address spaces. Each address space is isolated and has its own fixed address size. Register space is one of the address spaces: machine registers are assigned certain addresses taking into account their nesting; the subsequent operation with this space is carried out in the same manner as with the memory space. For non-von Neumann architectures, this model allows all memory spaces to be described without resorting to address translation and other similar techniques.
There are two types of address spaces: local and remote ones. Local address spaces are regarded as simple byte arrays local to the execution thread under analysis. In particular, this means that the value written at any address is received unchanged for subsequent reading. In addition, access to local address spaces must always end successfully (it cannot throw exceptions). An example of a local address space is a set of general-purpose registers.
Remote address spaces, first, do not guarantee any particular access behavior (e.g., I/O port space or memory space with memory-mapped devices) and, second, an access can result in an error. For remote address spaces, in addition to the size of the address, the size of the bit vector that describes the error occurred is specified. This makes it possible to simulate the behavior of processor instructions that can throw exceptions.
When working with the IR, all address spaces are combined into an address space table that assigns an individual index to each space.
Operations
An operation is the minimal unit of operational semantics in Pivot 2. An operation receives a set of bit vectors as an input and yields another set of bit vectors as an output; i.e., it can have an arbitrary number of arguments and results. The set of operations is not fixed, which makes the IR relatively easy to extend. At the same time, there are two important requirements for operations:
• an operation does not have any inputs and outputs, except for its own formal arguments and results; i.e., it is a “pure” function;
• within one operation, data dependences form a complete bipartite graph: all its inputs affect all its outputs.
Thus, a certain function add that receives two 32‑bit numbers and returns their sum along with the overflow attribute is an operation because all its arguments affect each of the results. A function xchg that receives two bit vectors of certain lengths and returns them in the reverse order is not an operation because each of its inputs does not affect each of its outputs.
The properties of operations formulated above enable certain types of data flow analysis (e.g., slicing and taint analysis) even when semantics of each operation is unknown.
All operations used in the process of analysis are combined into an operation table that assigns an individual index to each operation.
Temporary variables
Since Pivot 2 is a binary code representation, it has no types and works only with bit vectors of various lengths. All temporary variables are in the SSA form, which implies that the size of the variable is determined once when it is defined. Temporary variables are numbered beginning with 1. These numbers are local within the fragment, a structural element of the IR (see Section 5.2.6). The temporary variable with number 0 has a special interpretation: it is always regarded as a bit vector of a single zero bit.
A set of temporary variables with sequential numbers forms a group, which can be defined as a half-interval of the numbers of the variables included in it. The concept of the group is used below when discussing operators and data transfer between basic blocks.
Operators
An operator is an abstraction of one indivisible action within the IR. The set of operators in Pivot 2 is fixed and does not depend on specific characteristics of the instruction set. Operators receive a set of numbers for temporary variables (possibly, empty one) as an input and yield a group of temporary variables (possibly, empty one) as an output. Unlike operations, some operators have side effects, but they are fixed for each operator type. Operators carry out data moves (MIX, EXTRACT, CONCAT, and INIT), perform certain actions (INVOKE and CALL), or describe access to storage devices of the target machine (LOAD.L, LOAD.R, STORE.L, and STORE.R).
The MIX operator is used to group variables: input temporary variables are sequentially copied into variables of an output group. For example, the MIX operator that receives the set {1, 7, 2} as an input and yields a group beginning with number 10 as an output copies variable 1 to variable 10, variable 7 to variable 11, and variable 2 to variable 12. In the boundary case where the set of variables is empty, the MIX operator degenerates into an empty NOP operator.
The EXTRACT operator selects a part of a bit vector in its single input variable to form an individual output variable. In this case, the half-interval of bit indices that bounds the field to be extracted is a constant parameter of the operator. Indexes cannot depend on temporary variables. This operator enables more accurate taint analysis, even when the semantics of individual operations is not taken into account.
The CONCAT operator joins several bit vectors into one. The size of the output variable is determined as the total size of the input variables.
The INIT operator puts a constant bit vector into the variable.
The INVOKE operator applies a specified (by an index in the operation table) operation to a set of variables to yield an output group corresponding to the result of this operation.
The CALL operator calls an IR fragment (specified by an index in the fragment table) as a subroutine. Like the INVOKE operator, it receives a set of input variables to form an output group. For interpretation, a call stack is supported whose elements are pairs (index of the caller basic block, number of the operator in the block). Thus, when the exit node of the callee fragment is reached, control is returned to the operator that comes after CALL.
The LOAD.L operator describes reading from a local address space. For this operator, the index of the address space in the table, number of the variable containing the address, and size of the value to be read are specified. In addition, byte order (little endian or big endian) is set. This is necessary because, even in the same instruction set, there may be instructions that interpret byte order in memory in different ways. The read result is saved in a temporary variable specified.
The STORE.L operator describes writing to a local address space. For this operator, the index of the address space, number of the variable containing the address, number of the variable containing the value to be written, and byte order are specified.
The LOAD.R operator is used to read from a remote address space. Recall that reading from a remote address space can throw an exception; therefore, in addition to the attributes specified for LOAD.L, in this case, a reference to an exception handler fragment is also provided. As an input parameter, this fragment receives a bit vector with an error descriptor, the size of which is set for each remote address space.
The STORE.R operator implements writing to a remote address space and differs from STORE.L in the same way as LOAD.R differs from LOAD.L.
The operators LOAD.R and STORE.R can be annotated with the PROBE flag, which changes their behavior as follows: the access itself is not performed, but its admissibility is checked. If the access can be carried out successfully, execution jumps to the next operator; otherwise, it jumps to the error handler.
Thus, there are ten operators in total, each of which has a clear behavior and can be analyzed quite easily, which was one of the goals when designing this IR.
Basic blocks
A sequence of operators that is continuous in terms of control flow forms a basic block. In addition to this sequence, the basic block also specifies an input group of local variables and an output group. When control is transferred from one basic block to another, the variables from the output group of the first one are copied into the variables from the input group of the second one. Thus, if two blocks are connected by an edge, their inputs and outputs should match. This approach does not require the explicit introduction of the φ-function while replacing it with some simple actions performed on the edges.
To facilitate the work with the IR, all basic blocks formally have two outgoing edges—“false” one and “true” one—as well as a dedicated temporary variable that controls branching. This control variable must always have a size of 1 bit and be defined in the current block or in some of its dominators. In the case where the basic block has only one outgoing edge (unconditional branch), a special variable with number 0 (which always has a “false” value) is used as the control variable. If the basic block has more than two branches, then it is replaced by a cascade of branches.
All basic blocks analyzed within a certain task are combined into a basic block table that assigns an individual index to each block. Thus, references to successive blocks are stored in the form of indices.
Fragments
A fragment is the largest structural unit of the IR. The fragment is basically a directed graph with a single entry and a single exit, i.e., a hammock. The nodes in this graph are basic blocks, while the edges correspond to possible branches. Thus, the fragment can be regarded as a subroutine.
An entry basic block can have a non-empty input group. In this case, the corresponding variables should be defined before working with the fragment; these variables are its input parameters. Similarly, the variables from the output group of an exit basic block are the result of the fragment’s operation. Thus, fragments can call one another by using the CALL operator or can be employed as exception handlers in LOAD.R and STORE.R.
Like the other IR components, fragments are combined into a fragment table and are assigned indices for cross-references.
Contexts
All tables mentioned above (address space table, operations tables, basic block table, and fragments table) together form a context. Subsequent analysis is always carried out within a certain context.
CODE ANALYSIS BASED ON A DECLARATIVE PROCESSOR MODEL
A general solution that enables declarative processor model specification and is basic for many types of binary code analysis can be obtained by combining the abstract interpretation approaches [19] and ideas proposed in Sections 4.2 and 5.2.
Let us first consider the functional blocks that the processor model should provide.
1. Decoding the binary code of the target processor architecture. Depending on a particular task, the source of binary code can be the static image of a compiled program, store snapshot, or a sequence of instructions executed in the process of dynamic analysis. This block yields a structure that describes the decoded instruction, in particular, its length, mnemonic, and set of operands (including addressing modes and particular numbers of registers, addresses, offsets, etc.).
2. Translating the decoded instruction or block of several instructions into the IR described in Section 5.2. For this purpose, it is required to use external specifications of processors that associate Pivot fragments with machine instructions. As a result, we obtain the Pivot code that describes the behavior of one or several instructions.
3. Optimizing the Pivot code. With a machine instruction being the basic translation unit, redundant computations (in particular, computation of operand addresses) are inevitable when composing fragments. To reduce the size of the IR, the following optimizations are carried out within the fragment:
3.1. constant folding and propagation;
3.2. common subexpression elimination;
3.3. elimination of redundant LOAD.L and STORE.L operators (in this case, LOAD.R and STORE.R, whose logic can vary in different analysis scenarios, generally cannot be excluded).
4. For some analysis scenarios, it is also required to simulate the machine’s behavior between instructions, namely, instruction fetching logic, exception handling, etc. This is necessary for full-system analysis, as well as for more accurate generation of the control flow graph (CFG) when the target architecture supports hardware loops, etc.
Thus, the declarative processor model consists of the following components:
• the instruction decoder as described in Section 4.2;
• the set of Pivot fragments that define the operational semantics of individual machine instructions, as well as the table for comparing these fragments with the decoding results;
• the Pivot code fragment that describes the processor’s exception handling logic and instruction fetching logic, i.e., the actions that are performed between the instructions.
Given a database of declarative processor models, binary code of various architectures can be processed in a unified way by implementing abstract interpretation on top of the IR.
Let us fix a lattice of abstract states L and define the following two transfer functions:
• the transfer function for basic blocks (TB) that maps the abstract state at the block’s input to an output state;
• the transfer function for edges (TE) that maps the abstract state at the edge’s input to an output state.
It should be noted that these definitions imply a problem with forward propagation of the state. In a problem with backward traversal of CFG edges, symmetric definitions are used. Note also that, instead of the function TB, the function TS, which operates on an individual IR operator, can be defined. In this case, the function TB is constructed as a composition of TS for the operators that constitute the basic block.
Taken together, the lattice, transfer functions, and initial state at the entry node allow us to formulate the data flow analysis problem (and the corresponding abstract interpretation) for a Pivot fragment. The algorithm that yields the MFP solution to this problem does not depend on the problem statement and is universal. Moreover, with this method of defining the transfer functions, the algorithm can be implemented in two ways whereby states are stored for basic blocks (the classic version described, e.g., in [28]) or for edges.
Basically, this algorithm describes the universal approach to static analysis. Depending on the lattice and transfer functions selected, it can be used for evaluation of variables, static taint analysis, bug detection, etc. In addition, based on this algorithm, the optimizations described above (which eliminate artifacts of binary code translation into the IR) can be implemented.
Symbolic execution can be implemented in a similar way. Since symbolic execution implies analyzing each individual path (and, ideally, generating the MOP solution to the data flow analysis problem) while considering each executed operator in the framework of certain abstract interpretation, the data flow analysis problem formulated above is relevant in this case as well. The main difference is the need for a scheduler to select the next path upon reaching a branch in the program.
Finally, dynamic analysis can be regarded as a simplified version of symbolic execution whereby, for each branch, the control transfer target is known. Thus, dynamic analysis is a special case of abstract interpretation and can be carried out using the constructs described above. The problem of concrete interpretation should be especially noted: it is easy to see that, by identifying concrete states with abstract states and selecting a proper type of the lattice L, concrete interpretation can be carried out based on the abstract one.
Finally, we can add that, with the Cartesian product of lattices being a lattice, several aspects of program behavior can be analyzed in one run. In particular, this is of interest for dynamic analysis (e.g., concrete execution can be accompanied by analysis of access to uninitialized memory, analysis of synchronization primitives, etc.), as well as for concolic execution, whereby one part of a program or software system undergoes symbolic execution, while another part undergoes concrete execution.
CONCLUSIONS
In this paper, we have systematized binary code analysis problems and approaches. Taking into account the specific features of modern processor architectures and the experience of presently available tools for decoding machine instructions and describing their operational semantics, a new intermediate presentation (IR) for binary code analysis has been proposed. It has been shown that this IR can be consistently applied to various problems of static and dynamic analysis of binary code and symbolic execution.
Our further work will include the implementation of the proposed ideas as a set of open source libraries and the construction of a database for declarative descriptions of popular processors.
FUNDING
This work was supported by the Russian Foundation for Basic Research, project no. 18-07-01256 A.
Translated by Yu. Kornienko
REFERENCES
1 Wang, Xi; Zeldovich, Nickolai; Kaashoek, M. Frans; Solar-Lezama, Armando. A Differential Approach to Undefined Behavior Detection. ACM Transactions on Computer Systems; 2015; 33,
2 Nethercote, N.; Seward, J. Valgrind: A framework for heavyweight dynamic binary instrumentation. ACM SIGPLAN Not.; 2007; 42, pp. 89-100. [DOI: https://dx.doi.org/10.1145/1273442.1250746]
3 Chipounov, V. and Candea, G., Enabling sophisticated analyses of x86 binaries with RevGen, Proc. IEEE/IFIP 41st Int. Conf. Dependable Systems and Networks Workshops (DSN-W), 2011, pp. 211–216.
4 Lattner, C. and Adve, V., LLVM: A compilation framework for lifelong program analysis and transformation. Proc. Int. Symp. Code Generation and Optimization: Feedback-Directed and Runtime Optimization, 2004, pp. 75–86.
5 Song, D., Brumley, D., Yin, H., Caballero, J., Jager, I., Kang, M.G., Liang, Z., Newsome, J., Poosankam, P., and Saxena, P., BitBlaze: A new approach to computer security via binary analysis, Inf. Syst. Secur., 2008, pp. 1–25.
6 Padaryan, V.A.; Solov’ev, M.A.; Kononov, A.I. Simulation of operational semantics of machine instructions. Program. Comput. Software; 2011; 37, pp. 161-170. [DOI: https://dx.doi.org/10.1134/S0361768811030030]
7 Brumley, D.; Jager, I.; Avgerinos, T.; Schwartz, E.J. BAP: A binary analysis platform; 2011;
8 Dullien, T. and Porst, S., REIL: A platform-independent intermediate representation of disassembled code for static code analysis, Proc. CanSecWest Conf., 2009.
9 Bellard, F., QEMU: A fast and portable dynamic translator, Proc. USENIX Annual Technical Conf., 2005.
10 Luk, C.K.; Cohn, R.; Muth, R.; Patil, H.; Klauser, A.; Lowney, G.; Wallace, S.; Reddi, V.J.; Hazelwood, K. Pin: Building customized program analysis tools with dynamic instrumentation. ACM SIGPLAN Not.; 2005; 40, pp. 190-200. [DOI: https://dx.doi.org/10.1145/1064978.1065034]
11 Bruening, D. and Amarasinghe, S., Efficient, transparent, and comprehensive runtime code manipulation, PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2004.
12 Chipounov, V. and Kuznetsov, V., S2E: A platform for in vivo multi-path analysis of software systems, Proc. 16th Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.
13 Cha, S.K., Avgerinos, T., Rebert, A., and Brumley, D., Unleashing mayhem on binary code, Proc. IEEE Symp. Security and Privacy (SP), 2012, pp. 380–394.
14 Padaryan, V.A.; Kaushan, V.V.; Fedotov, A.N. Automated exploit generation method for stack buffer overflow vulnerabilities. Tr. Inst. Sistemnogo Program. Ross. Akad. Nauk; 2014; 26, pp. 127-144.
15 Kruegel, C., Valeur, F., Robertson, W., and Vigna, G., Static analysis of obfuscated binaries, Proc. 13th USENIX Security Symp., 2004, pp. 255–270.
16 Ben Khadra, M.A., Stoffel, D., and Kunz, W., Speculative disassembly of binary code, Proc. Int. Conf. Compilers, Architectures, and Synthesis for Embedded Systems, 2016.
17 Balakrishnan, G. and Reps, T., Analyzing memory accesses in x86 executables, Proc. 13th Int. Conf. Compiler Construction, 2004, pp. 5–23.
18 Aslanyan, H., Asryan, S., Hakobyan, J., Vardanyan, V., Sargsyan, S., and Kurmangaleev, S., Multiplatform static analysis framework for program defects detection. Proc. CSIT Conf., 2017.
19 Cousot, P. and Cousot, R., Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints, Proc. 4th ACM SIGACT-SIGPLAN Symp. Principles of Programming Languages, 1977, pp. 238–252.
20 Padaryan, V.A.; Getman, A.I.; Solovyev, M.A.; Bakulin, M.G.; Borzilov, A.I.; Kaushan, V.V.; Ledovskikh, I.N.; Markin, Yu.V.; Panasensko, S.S. Methods and software tools to support combined binary code analysis. Program. Comput. Software; 2014; 40, pp. 276-287. [DOI: https://dx.doi.org/10.1134/S0361768814050077]
21 GNU Binutils. http://www.sourceware.org/binutils. Accessed December 3, 2018.
22 Capstone. http://www.capstone-engine.org. Accessed December 3, 2018.
23 IDA Pro. http://www.hex-rays.com/products/ida/index.shtml. Accessed December 3, 2018.
24 Fauth, A., Van Praet, J., and Freericks, M., Describing instruction set processors using nML, Proc. European Design and Test Conf., 1995, pp. 503–507.
25 Hadjiyiannis, G., Hanono, S., and Devadas, S., ISDL: An instruction set description language for retargetability, Proc. 34th Annual Design Automation Conf., 1997, pp. 299–302.
26 Fox, A., Improved tool support for machine-code decompilation in HOL4, Proc. Int. Conf. Interactive Theorem Proving, 2015, pp. 187–202.
27 Gray, K.E., Kerneis, G., Mulligan, D., Pulte, C., Sarkar, S., and Sewell, P., An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER multiprocessors, Proc. 48th Int. Symp. Microarchitecture, 2015, pp. 635–646.
28 Muchnick, S.S. Advanced Compiler Design and Implementation; 1997;
© Pleiades Publishing, Ltd. 2019.