Research topics for interns, Masters or thesis

Software diversification

Polymorphing GraphQL queries
Diverse Multi-compilation for Trusting trust
Java – Kotlin translation to diversify bytecode
Automatic generation of 1 Million libc
Harnessing the natural redundancies in fork-based development for automatic diversity
Instruction set randomization for WebAssembly

Software supply chain


Diversifying a npm registry
Automatic specialization of the JRE
Systematic decompilation in the CI to mitigate supply chain attacks
Detecting superfluous conflicts in Java projects
Visually exploring data structure choices

Software testing

NumPy: test consolidation with Descartes
Automatic synthesis of Mock objects based on Production observations
Full coverage: analyzing the effectiveness of diverse coverage tools for Java
Amplifying library test suites with client usages
Modular: Analyzing test suites in multi-module Maven projects
A journey in modern testing suites on Github

Off the beaten track

Github repositories with literary references
The anatomy of the most Enterprise email client
Easter egg VM flag

Remixes

Remix Descartes in Javascript
Web stalker remix: deconstructing modern browser technology
Remix neural decompilation for JVM bytecode

Software diversification

Polymorphing GraphQL queries

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

GraphQL is increasingly adopted for web APIs [1], making it a good target for exploits [2]. In this work investigate polymorphing to harden GraphQL APIs [3]. The student will develop a randomization scheme for the API and the corresponding adaptation of the client queries in order to build an effective protection against injection attacks.

Diverse Multi-compilation for Trusting trust

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

The problem of deceptive compilers introducing malicious code is relevant and hard [1,2]. One solution for this is to use multiple diverse compilers to mitigate the problem [3]. For instance, one can compile a C program with both GCC and CLANG. You will design, implement and evaluate a multi-compiler scheme for C.

Automatic generation of 1 Million libc

libc is at the core of most software stacks, but it is fragile, prone to critical vulnerabilities [1]. In this work we explore a combination of techniques to generate large amounts of diverse implementations of libc [2]. The student will combine the abundant combinations of flags of C compilers [3], with state of the art code transformation and obfuscation techniques [4] to generate many libs variants.

Java – Kotlin translation to diversify bytecode

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

The transition from Java to Kotlin is timely and hard problem [1].
In this work, we explore the natural diversity of translation strategies from Java to Kotlin [2], as well as the diversity of compilation options of koltinc [3] and javac [4]. The goal is to assess the ability of these strategies to generate diverse versions of Java bytecode for the same piece of source code.

Harnessing the natural redundancies in fork-based development for automatic diversity

Recent work has demonstrated the presence of redundant implementation efforts in a fork-based development process [1]. While this might appear as wasted effort, in this work we see this as a unique opportunity for software evolution. The presence of different solutions that address the same feature is key for the robustness of natural systems. In this project we investigate the opportunity to extract redundant implementations from various forks and use them to build diverse versions of a project.

Instruction set randomization for WebAssembly

WebAssembly has been recently proposed as a lightweight bytecode language for the web. It is has designed with high performance in mind. Its instruction set and semantics has been specified as part of a collaboration between the major browser vendors. Since the proposal in 2017, its adoption is growing rapidly. Along this success, security requirements increase.

In this project we explore the adaptation of Instruction Set Randomization for WebAssembly. The idea is generate a randomized WebAssmbly instruction set for each browser and automatically rewrite WebAssembly programs that are launched in this browser. This technology aims at mitigating code injection and code reuse attacks.

Software supply chain

Diversifying a npm registry

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Dependency confusion is a growing threat for software supply chain [1]. This attack consists in uploading malicious packages on public repositories, which will eventually be packaged in applications, through dependency resolution mechanisms. In this work, we will explore the automatic randomization of instructions [3] in private npm registries to mitigate dependency confusion [2]. The student will deploy a local npm registry and a instruction randomization scheme, along with the adaptation of the javascript engine to correctly execute the randomized packages.

Automatic specialization of the JRE

The Java Runtime Environment (JRE) is a great, general purpose execution engine, which provides the standard Java libraries.
Because it is general purpose, it offers too much functionality, when considering only one Java application that runs in the JRE.
You will design and experiment with a system that automatically specializes the JRE for a specific Java application, using jcov [1] to identify the parts that are necessary and the parts that can be removed. This topic contributes in hardening the software supply chain through debloating [2] and specialization of the software stack [3].

Systematic decompilation in the CI to mitigate supply chain attacks

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Supply chain attacks [1] represent a growing threat on software systems, as illustrated by the Solar Winds attack in late 2020 [2]. One of these attacks consist in tampering with the code at one point in the automatic build pipeline, in order to inject malicious code into the binary. In this work, we investigate the systematic disassembly of binary [3], at the end of the build pipeline, to detect the injection of malicious code injection.

Detecting superfluous conflicts in Java projects

Build systems and package manager encourage and support reuse of third-party libraries through a clean declaration of third-party dependencies and an automated build process. Such systems include npm, Maven, Cargo. While reuse and automation are strong software engineering principles, the current use of build systems has introduced some new challenges. One of them is that some developers do not remove dependency declarations from the declaration file, resulting in unnecessary libraries being packaged in the built binary. This can account for up to 40% of libraries [1].

In this project we study the relationship between these unnecessary dependencies and the presence of conflicts in third-party library versions. If a conflict is due to unnecessary dependency, we call it a superfluous conflict. The goal here is to quantify superfluous conflicts over the whole Maven central repository.

  • [1] N. Harrand, A. Benelallam, C. Soto-Valero, O. Barais, B. Baudry. Analyzing 2.3 Million Maven Dependencies to Reveal an Essential Core in APIs. arXiv.

Visually exploring data structure choices

Software developers take many arbitrary decisions when implementing software applications. One of them is the choice of data structures. The student will work on a system that can automatically vary data structure choices in programs and vizualize the impact of this change. Impact can be assessed with respect to the size the final jar produced, the diversity of execution traces, the transitive dependency graph.

Software testing

NumPy: test consolidation with Descartes

NumPY is a fundamental package for scientific computing with Python, as well as an excellent illustration of state of the art software engineering [1]. For example, the NumPY community uses four different continuous integration systems [2]. Its crucial importance for science calls for a rock-solid test suite, in order to ensure the validity and reproducibility of scientific experiments.
You will dive deep into the test suite of NumPy and aim at making it stronger through a systematic assessment of the test cases. You will investigate the presence of pseudo-tested methods [3] and contribute test improvement to NumPy’s test suite.

Automatic synthesis of Mock objects based on Production observations

Mock objects are highly valuable to create predictable test environments, which speed-up test execution and limit flaky tests. Yet, the development of relevant mock objects is challenging, since there is currently no support to determine the validity or value of manually selected values for mocks.
You will design a system that observe an application in production in order to collect real program state values that will then be turned into mock objects. This system will leverage efficient observability technology [3] in order to contribute to the state of the art of automated test generation [1,2].

Full coverage: analyzing the effectiveness of diverse coverage tools

Code coverage is a key metric to assess test suite quality as well as to perform dynamic analyses [1]. Yet, there exist a variety of test coverage tools, each with their strengths and quirks [1,2].
You will design and perform a systematic analysis of the main coverage tools for a specific programming language, e.g. Java [2,3,4], in order to determine which is the most appropriate combination of tools for the most accurate measurement of full coverage.

Amplifying library test suites with client usages

Third-party libraries are at the core of the software supply chain [1]. Their test suites are essential to ensure the quality of this infrastructure.
One solution to consolidate these test suites consists in carving additional test cases by running the clients of these libraries [2].
You will design, implement and evaluate a test carving tool for Java libraries [3].

Modular: Analyzing test suites in multi-module Maven projects

Medium to large sized Maven projects tend to leverage a multi-module structure. This allows developers to keep a modular architecture and favors good development practices in collaborative efforts. In this type of project, it is usual that the test suite of one module also specifies code from other modules. Yet, most practical tools to assess the quality of test suites, specifically mutation testing tools, assume that all unit test cases are included in the same module, an assumption that misses the aforementioned practices. This project aims at providing empirical evidence about the interactions between multi-module projects, their test suite and the capacity of test assessment tools.

A journey in modern testing suites on Github

Automatic testing is boosted by a wide adoption of DevOps, continuous integration that runs test suites on every commit and good support for feedback for devlopers, for example through the Github UI. This has triggered the development of a large number of techniques and tools to support developers for writing sophisticated, efficient test suites. These techniques span modern frameworks to specify test assertions, such as AssertJ, advanced test concepts such as property-based testing, for example with QuickTheories in Java, or the development of novel virtualization technology to support testing in multiple environment, e.g., testcontainers. In this work, we explore how these modern techniques and tools for software testing are actually used in open source projects. We aim at finding extraordinary examples of test suites, determining trends in the adoption of certain techniques and in letting software testers from industry and academia have a journey in modern test suites.

Off the beaten track

Github repositories with literary references

Github repositories are rich sources of code, documentation and discussions. They also contain amazing resources such as images, sound snippets, texts or references. A recent study has analyzed the presence of links to academic papers in Github repositories [1]. This study reveals the critical importance of linking code, data and publications to improve replication in computational science. In this work we wish to explore literary references in Github. For example, references to Bob Dylan cited in C code or novel quotes in comments, perl -le’$_=`perldoc -T perlfaq4`,s/^.*N;(.*?)E.*$/$1/s,print’.

The study seeks to unveil the deep connection of Github with culture and society and to analyze the role of literature on software development.

Anatomy of Outlook mail

Everyday we use extraordinary software objects. Examples of such objects include the Android mobile systems that run on billions of devices, the domain name system that runs the web, or the Outlook email client that lets millions of workers communicate efficiently. These objects are extraordinary in several respects: they are large, they are composed of hundreds of diverse software parts, they evolve fast, they exist in many versions that are tailored to various needs. The massive presence of such objects, as well as the very large dimensions that characterize them are intriguing for software developers and for users. One approach to unveil the extraordinary nature of these objects consists in breaking down all of its components turning into an anatomical analysis of the object [1,2].

In this work, we aim at building a fine-grained anatomy of an extraordinary, extremely popular software object: the Outlook email client.

Easter egg VM flag

Easter eggs, sometimes called the final frontier of software development [10]. (Except that of course you can’t have a final frontier, because there’d be nothing for it to be a frontier to, but as frontiers go, it’s pretty penultimate . . .) [269696]. And against the wash of continuous integration a commit hangs, bloated and poetic, one single, cool contribution, gleaming like the madness of gods. Nearly unreal. Reality is not digital, an on-off state, but analog. Easter eggs are for lovers and for the mind. Not enterprise, nor a resurrection, they cherish enchantment and freedom. In the quest for technology and Mastery, you will add an extra mile to the frontier with a new Easter flag for an extraordinary virtual machine [42].

[42] java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version
[10] Curated list of all the easter eggs and hidden jokes in Python.
[269696] Moving Pictures. T. Pratchett. 1990, on Monday afternoon, just before tea.

Remix topics

A remix in music and art appropriates and changes other materials to create something new, often as form of tribute to the original material. Through the following research topics we want to explore remix in software research: take a concept, an experiment explored in one context (a specific language, application domain or period in time) and explore it again in another context.

Remix Descartes in Javascript

Pseudo-tested methods [1] are dangerous methods: these methods are reached by the at least one test case in the test suite, code coverage is good for these methods and still, if the whole body of the method is removed, no test case notices it. In other words, there exist no test input or test oracle that can trigger or assess a behavior such that the removal of a complete method body is detected. Descartes [2] is an open source tool that can automatically detect pseudo-tested methods in Java projects. The objective of this work is to build a similar tool for Javascript as an extension of Stryker [3].

Web stalker remix: deconstructing modern browser technology

In 1998, Simon Pope, Colin Green and Matthew Fuller designed the Web Stalker, an alternative web browser that displays the structure of web pages instead of its content [1]. The work was motivated by a strong motivation to understand what happens beyond the screen and to let web users experience this understanding. Twenty years later, the adoption of the web has massively radiated in all aspects of our lives and the complexity of web browsers has exploded.
This project is about rethinking a web stalker in the era of modern web browsers, going from the design of a solution that leverages the architecture of these browsers [2] to the implementation of an artistic representation of web pages content based on Electron [3].

Remix neural decompilation for JVM bytecode

A decompiler takes compiled code (e.g. x86 code) and produces source code. Decompilation is an essential step for program comprehension, security analyses, etc. However, it is challenging to write an accurate decompiler (that can retrieve the source code that actually corresponds to the compiled code) and the implementation of decompilers currently relies on the careful, manual design of deceompilation rules. Some recent works [1,2] have proposed to use machine learning in order to train a decompiler. These works successfully applied this concept to decompile from binary to C source code. In this work, we wish to remix the exciting concept of decompiler learning for Java bytecode.