PhD: Protecting source code through multi-stage diversification

Keywords: software diversity, source code transformation,static and dynamic analysis, empirical sotware engineering, statistics

Contact: Benoit Baudry (



This PhD will be in the field of software diversity. Software diversification was initiated by the seminal work of Cohen [1] and Forrest [2], which established the foundations of software diversity in operating system (OS) and source code. These works introduce the importance of randomization in the diversification process, highlighting two families of randomization: randomly adding or deleting non-functional code and reordering code.
Since then, many works have investigated the opportunities of randomization for code protection. Wang et al. [3] introduce diversity at multiple levels in the control flow so as to provide in-depth obfuscation. Wang et al’s architecture relies on probing mechanisms that integrate two forms of diversity: in time (the probe algorithms are replaced regularly) and in space (there are different probing algorithms running on the different nodes of the distributed system). Lin et al. [4] randomize the data structure of C code. Following the line of thought of Forrest et al. [2] they re-order fields of data structures and insert garbage ones.
Banescu et al. [5] exploit software diversity, along with white-box cryptography against changeware. Browser hijacking malware is one popular example of changeware that aims at changing web-browser settings such as the default search engine or the home page.
For an overview on code obfuscation, we refer to the now classical taxonomy by Collberg and colleagues [6]. Our recent literature survey [7] presents a broader overview of software diversity.


The main novelty of this work, w.r.t the state of the art, is to investigate diversification techniques, which specifically target object-oriented, interpreted languages (e.g., Java or JavaScript).
The motivations for this are manyfold: little attention has been paid to these languages in the obfuscation / diversification literature despite their massive presence in all domains; the design of these languages present characteristics that can be exploited to develop novel types of transformations that are not possible on languages like C; performing multi-stage (source code, bytecode, interpreter) transformations allows to exploit various forms of knowledge that are present only at specific levels.
In the following, we illustrate different forms of diversification that we will investigate during the PhD.

Exploiting polymorphism to synthesize code variants.

The first object-oriented characteristic we want to exploit for diversification is called the Liskov principle. This principle states, given a variable of type A, it can be assigned with an instance of any subtype of A. Based on this principle and on the fact that programs reuse many external libraries, one possible diversification transformation consists in targeting all variables that instantiate an external library and replacing them by an instance of a valid subtype.
Even if this can appear as a very small change in the source code, it can have a radical impact on the compiled code (that includes libraries and that will have different control flow, memory layout and performance than the original).

Conjoint diversification of code and interpreter

Java programs are not compiled to machine code. Instead, they are compiled to bytecode, an intermediate representation, which is passed to an interpreter in order to run the program. Hence, the interpreter, i.e. the Java Virtual Machine (JVM) embeds the semantics of the language. For example, the expression a+3 is interpreted by the JVM as the addition of 3 to the value of a. In this project, we want to investigate the opportunity to conjointly transform the bytecode and the interpreter to create diverse versions of a program. For example, we can transform the expression into a-3 and transform the JVM so that the “-” operator has the semantics of the addition. Consequently, the transformed expression, interpreted by the transformed JVM has the same semantics as the original one even though the bytecode is different.
This approach drastically raises the bar for reverse-engineering since now it is necessary to reverse both the interpreter and the program in order to understand how they conjoint execution behaves.

Software diversity measurement.

The last part of the methodology for this PhD will consist in defining sound metrics to quantify the diversity among a set of program variants.
These metrics can consider different aspects that can vary among the variants, for example:

  • control flow: the diversity metric will be a function of the pairwise distances between the control flow graphs of program variants;
  • performance: the diversity metric can be the performance envelop inside which the set of variants are measured (the bounds inside which the performance metrics vary);
  • attack reuse: given a specific attacker model, to what extent can be the same exploit be reused against all variants.


[1] F. B. Cohen. Operating system protection through program evolution. Computers & Security, 12(6):565–
584, 1993.
[2] S. Forrest, A. Somayaji, and D. Ackley. Building diverse computer systems. In Proc. of HotOS, pages
67–72, 1997.
[3] C. Wang, J. Davidson, J. Hill, and J. C. Knight. Protection of software-based survivability mechanisms. In Proc. of DSN, pages 193–202, 2001.
[4] Z. Lin, R. D. Riley, and D. Xu. Polymorphing software by randomizing data structure layout. In Detection of Intrusions and Malware, and Vulnerability Assessment, pages 107–126. Springer, 2009.
[5] S. Banescu, A. Pretschner, D. Battré, S. Cazzulani, R. Shield, and G. Thompson. Software-based protection against changeware. In Proc. of CODASPY ’15, pages 231–242, 2015.
[6] C. Collberg, C. Thomborson, and D. Low. A taxonomy of obfuscating transformations. Technical report,
Department of Computer Science, The University of Auckland, New Zealand, 1997.
[7] B. Baudry and M. Monperrus. The Multiple Facets of Software Diversity: Recent Developments in Year
2000 and Beyond
. ACM Computing Surveys, 48(Accepted for publication):16:1–16:26, Sept. 2015.

Working Environment

The candidate will work at INRIA in the DIVERSE team (workplace: Université Rennes 1, Campus de Beaulieu, 35000 Rennes, France). DIVERSE’s research is in the area of software engineering, focusing on the management of diversity in the construction of software intensive systems. The team is actively involved in European, French and industrial projects and is composed of 8 faculty members, 18 PhD students, 5 postdocs and 4 engineers.
The contract is for 36 months. The monthly net salary is around 1800 euros.