Content area
The illegal copying of Windows applications is a significant problem, because MS-Windows is the most popular operating system in Korea. A software birthmark technique is one of the solutions for protecting software copyright and can distinguish software piracy using feature information from the software. The types of software birthmarks can be categorized into static birthmark and dynamic birthmark by the extraction method. In this paper, we propose a similarity measurement technique using dynamic birthmark based on an application program interface (API), and we design and implement the extraction process of the dynamic birthmark. Our experimental results show that the proposed measurement technique is better than the existing measurement technique.
Abstract
The illegal copying of Windows applications is a significant problem, because MS-Windows is the most popular operating system in Korea. A software birthmark technique is one of the solutions for protecting software copyright and can distinguish software piracy using feature information from the software. The types of software birthmarks can be categorized into static birthmark and dynamic birthmark by the extraction method. In this paper, we propose a similarity measurement technique using dynamic birthmark based on an application program interface (API), and we design and implement the extraction process of the dynamic birthmark. Our experimental results show that the proposed measurement technique is better than the existing measurement technique.
Key Words: Software Birthmark, Similarity Measurement, Source Code Piracy
(ProQuest: ... denotes formulae omitted.)
1. Introduction
Microsoft Windows has a high market share in the domestic operating system market. Additionally, most software markets are of Windows software, which is why the illegal copying of Windows software has become such a serious problem. The piracy of source code can cause damage to the company in terms of both monetary cost and image.
Copyright protection techniques have been studied in various areas to reduce such damage and protect the copyright. Typical copyright protection technology includes Software Birthmark. Software birthmarks distinguish between static birthmark and dynamic birthmark.
The measurement of similarity is necessary to identify the source code piracy. The source code is the company's key asset, which is why most companies do not want it to be disclosed. Thus, we need to be able to compare the similarity between software without the source code. In this paper, we measure similarity between software using Dynamic Birthmark, which enables measurement of similarity without the source code.
The remainder of this paper is organized as follows. Chapter 2 discusses the related work regarding software birthmarks. Chapter 3 explains the extraction method of dynamic birthmark and the proposed similarity measurement technique. In chapter 4, we evaluate through experiment the best grouping value and performance of various similarity measurement techniques. Chapter 5 provides the conclusion and future research.
2. RELATED WORK
2.1 Software Birthmark
Software Birthmark is used to compare the similarity; it is created by extracting the feature information of the software. Software Birthmark is distinguished from Static Birthmark and Dynamic Birthmark by the method of extracting the feature information.
Static Birthmark is a method for extracting a software birthmark by using the obtained information from the binary file or source code without executing the target software. Dynamic Birthmark is a method for extracting a software birthmark using the extracted information through the software execution. Although Dynamic Birthmark can be changed according to the execution environment or the input value, the resilience of the code obfuscation is better than that of Static Birthmark. The feature information of Dynamic Birthmark can use information such as the function call graph, the memoiy state change, and windows API.
2.2 Similarity Measurement technique of Software Birthmark
2.2.1 K-gram Similarity measurement technique
K-gram is the contiguous command set of length k in software. When K-gram that is extracted from software r and s assumes the f(r) and f(s), the similarity between the f(r) and f(s) is defined by:
... (2.1)
|f(r)| refers to the number of elements that do not overlap in f(r).
2.2.2 Cosine Similarity measurement technique
This technique is used in various fields. Cosine similarity is a measure of similarity between two vectors of an inner space. The similarity between x and y is defined by:
... (2.2)
2.2.3 The need for research
When the dynamic birthmark is extracted from two software that have similar functions, it can measure high similarity because it uses similar functions. Although two software are the same, Dynamic Birthmark can create different birthmarks by using scheduling, multi thread, and graphical user interface (GUI) noise. This has a stronger effect on a birthmark based on the call sequence than on a birthmark based on call frequency.
Table 1. compares Cosine similarity and K-gram similarity using the Dynamic Birthmark technique. As can be seen from the results of this study, different results are obtained when using K-gram similarity from when using Cosine similarity. In this paper, we propose a similarity measurement technique that uses both a calling frequency and a sequence.
3. Dynamic Birthmark based on API
This section describes the extraction method of dynamic birthmark and the proposed similarity measurement technique.
Fig. 1. illustrates the process of extraction of dynamic birthmarks and the proposed similarity measurement technique. While running the windows software using the same function, we extract the dynamic birthmark based on API. The matched lists and mismatched lists are distinguished using the Diff. We then measure the similarity between the matched lists and mismatched lists. We then calculate the final similarity by adding the similarity values of matched lists and mismatched lists.
3.1 Extraction of Dynamic Birthmark based on API
The dynamic birthmark used in this paper is extracted using the extracted information while running the windows software using the same function. The Windows API is an essential call to run the Windows software; it is thus very difficult to change the API call to other calls. Therefore, we determine that Windows API is important and we extract the dynamic birthmark based on Windows API.
Detours Library can extract 1667 API, which includes the essential Windows software API such as User32.dll, GDI32.dll, KERNEL32.dll, and WS2_32.dll. It is therefore possible to extract a valid dynamic birthmark.
3.1 Similarity measurement of Dynamic Birthmark based on API
The process of the proposed similarity measurement involves four steps. In the first step, the matched lists and mismatched lists are distinguished using the Diff. The next step then calculates the similarity between the matched lists and mismatched lists. The final step calculates the final similarity by adding the similarity values of the matched lists and mismatched lists.
The matched lists and mismatched lists are distinguished using the Diff. The matched lists are the same parts of the call that are shown to exactly match by comparing the software. We assume that s£, is the matched lists of software P and P', where n( ) refers to the number of elements. The similarity of matched lists is defined by:
... (3.1)
The similarity measurement technique of the mismatched lists separates the length r from the mismatched lists, and we then measure the similarity of the separated mismatched lists using cosine similarity. The entire similarity of mismatched lists is calculated by adding all the similarity values of each separated mismatched list. We assume that 0fr)= idvd2,.. dr)n * * {dr+"dr+2, . . dx)f) , D,{P\ = (dvd2,d3.. d)is the mismatched lists of software P, the vector set of D,{p\ is Di{p) = ((¿/,,k)_(dr,/)) * The k, j, and 1 are the frequencies of each element in D,(p\- The similarity of mismatched lists is defined by: (only, >
... (3.2)
The final similarity of software is calculated by adding the similarity values of the matched lists and mismatched lists.
4. Experiment and Evaluation
4.1 Experiment method and composition
In this chapter, we conduct an experiment to evaluate the proposed similarity measurement technique. The aim of the first experiment is to find the optimal grouping value that is used to measure the similarity of mismatched lists. The second experiment evaluates whether the proposed similarity measurement technique is suitable for measuring the similarity.
4.2 Find the optimum grouping value
The aim of this experiment is to find the optimal grouping value(r) that is used to measure the similarity of the mismatched lists. Grouping of the mismatched lists is carried out so the call sequence can be changed by thread or scheduler. Even if the calling sequence is changed, the call is likely to be in a similar location. For this reason, the mismatched lists are grouped by as much as the grouping value. We choose grouping values of 500, 1,000, and 2,000 to find the optimal grouping value. The experiments measured the similarity between software while changing the grouping value.
We choose windows software such as FTP Tool, a document editing program, a compression program, a music-playing program, and a video-playing program to evaluate the grouping value. Table 2. shows the categories and names of the software for the experiment.
Dynamic birthmarks can differ accordmg to the executed function in each category. Thus, we define each function that will be executed in the program. We extract the dynamic birthmark by executing the defined function. Moreover, we use the same file in the extraction process for each category. Table 3. shows the executed functions to extract the dynamic birthmark.
Fig. 2. shows a graph that measures the similarity between the same software while changing the grouping value. The higher the similarity, the better the good result in this experiment because the experiment measures similarity between the same software. In the experimental results, the higher the grouping value (r), the higher the similarity. When the grouping value is 2000, the similarity between the same software increases.
Fig. 3. shows a graph that measures the similarity between different software in the same category while changing the grouping value. In this experiment, the lower the similarity, the better the result. In the experimental results, the higher the grouping value (r), the higher the similarity.
The similarity between the same software increased by an average of 4.91%, while the similarity between different software in the same category increased by an average of 3.06%. Although similarity has increased for both the same software and different software in the same category, the optimal grouping value (r) is 2000 because the increase in similarity of the same software is higher than the increase in similarity of different software in the same category.
4.3 Performance comparison with existing similarity measurement techniques
This experiment verifies that the proposed similarity measurement technique is suitable for measuring similarity. Figs. 4 and 5 show graphs that illustrate the measured similarity between the same software using the three similarity measurement techniques. The Cosine similarity finds the most similarity of 90% or more in the two experiments. In spite of the same software, the K-gram similarity is low in the Video-Playing program, Music-Playing program, and Document editing program. This is because the dynamic birthmark may have created different birthmarks due to scheduling, multi thread, and GUI noise. The measured similarity using the proposed technique is lower than the measured similarity using the cosine similarity, but it is higher than the measured similarity using the K-gram similarity.
Fig. 6. shows a graph that illustrates the measured similarity between different software in the same category using the three similarity measurement techniques. The K-gram similarity finds the most similarity of 35% or lower in the experiment. In spite of the different software, the cosine similarity is high in the Video-Playing program, Music-Playing program, and FTP Tool. The measured similarity using the proposed technique is higher than the measured similarity using the K-gram similarity technique, but it is lower than the measured similarity using the cosine similarity.
5. Conclusion and Future Work
We proposed a technique that measures the similarity between software after distinguishing matched parts and mismatched parts. We conducted an experiment to find the optimal grouping value (r). When the grouping value is 2000, the similarity of the identical software and different software in the same category is higher. The optimal grouping value (r) is 2000 throughout the experiment, because the similarity increase of the same software is higher than the similarity increase of different software in the same category. Also, we verified that the technique is suitable for use in measuring similarity through experiments.
When we measure the similarity of the different software in the same category using the proposed similarity measurement technique in this paper, the average similarity is 31.67%. However, the software is developed independently because the dynamic birthmark contained essential windows API. In future work, an extraction technique needs to be developed that removes essential windows API from the dynamic birthmark to improve the accuracy of similarity measurement.
6. Acknowledgments
This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(2016R1D1A1B01016073)
References
[1] Ginger Marie Myles, "Software Theft Detection Through Program Identification", Ph.D. Thesis, Department of Computer Science, The University of Arizona, 2006.
[2] Collberg C. S., and Thomborson C., "Watermarking, tamper- proofing, and obfuscation-tools for software protection", IEEE Transactions on software engineering, Vol. 28, pp. 735-746, August. 2002.
[3] N.R Wanger, "Fingerprinting", IEEE Symposium on Security and Privacy, pp. 18-22, 1983.
[4] Tamada, H., Okamoto, K., Nakamura, M., Monden, A., and Matsumoto, K. I., "Dynamic software birthmarks to detect the theft of windows applications", International Symposium on Future Software Technology, Vol. 2004, October. 2004,
[5] Tamada, H., Nakamura, M., Monden, A., and Matsumoto, K. I., "Java Birthmarks-Detecting the Software Theft-", IEICE transactions on information and systems, Vol. 88, No. 9, pp. 2148-2158, 2005.
[6] Ginger Myles and Christian Collberg, "k-Gram Based Software Birthmarks", In Proceedings of the 2005 ACM Symposium on Applied Computing, pp. 314-318, 2005.
[7] Ricardo B. Y. and Berthier R. N., "Modem Information Retrieval" ACM Press, Vol. 463, 1999.
[8] Galen H. and Doug B., "Detours : binary interception of Win32 functions", In Proceedings of the Third USENIX Windows NT Symposium, pp. 135-143, 1999.
Cherl-hong Ko*, Dae Shin Park*, and Jiman Hong*
* School of Computer Science and Engineering, Soongsil University, Seoul, Republic of Korea
E-mail: [email protected], [email protected], [email protected]
Corresponding author: Jiman Hong, Ph.D.
School of Computer Science and Engineering, Soongsil University, 369 Sangdo-Ro, Dongjak-Gu, Seoul, 156-743, Korea
E-mail: [email protected]
Copyright International Information Institute Nov 2016