Research topics for interns, Masters or thesis

Software hardening

Full-stack debloating for a video conferencing system
Reproducible builds for Maven
Automatic specialization of the Java Runtime (JRE)
Systematic decompilation in the CI to mitigate supply chain attacks
Detecting superfluous conflicts in Java projects
Docker slimming in practice
API specialization in Kotlin
Live analysis of Webassembly in the browser

Software testing

NumPy: boosting the test suite of the Python numerical analysis package
Automatic synthesis of Java Mock objects based on Production observations
Effectiveness of diverse coverage tools for Java
Amplifying Kotlin library test suites with client usages
From JSON to Java records

Software diversification

Diversifying a npm registry
Polymorphing GraphQL queries
Diverse Multi-compilation for Trusting trust
Java – Kotlin translation to diversify bytecode
Automatic generation of 1 Million libc
Automatic synthesis of diverse replacements for Java expressions
Superdiversifying SHA256
Automatic diversification of Kafka

Off the beaten track

Github repositories with literary references
The anatomy of the most Enterprise email client
Easter egg VM flag
Web stalker: deconstructing modern browser technology (remix)
Paint Splatters & Perl Programs (remix)

Software hardening

Full-stack debloating for a video conferencing system
Software bloat is data and code that accumulates over time and yet is not necessary for an application to behave correctly. Several techniques have been proposed over the last years to detect and remove bloat. These techniques complement each other since they analyze bloat at different levels of the software stack (libraries, containers, kernel, etc.). Yet, no previous work has studied the combined effect of these techniques
For this thesis you will apply different debloating techniques such as DepClean [1], docker-slim [2] and unikernels [3]. You will measure the effects of each technique and their combination on the jitsi video conferencing system.

Reproducible builds for Maven
Supervisor: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Reproducible builds is an essential property for secure software supply chains [1]. There is ongoing effort in some Linux distributions, in particular Debian, to ensure reproducible builds [2]. In the Java world, there is little work on this topic and no clear understanding of the problem. You will design, perform and analyze an experiment to assess the status quo of reproducible builds in Java and a tool to improve build reproducibility.

Automatic specialization of the Java Runtime (JRE)E

The Java Runtime Environment (JRE) is a great, general purpose execution engine, which provides the standard Java libraries.
Because it is general purpose, it offers too much functionality, when considering only one Java application that runs in the JRE.
You will design and experiment with a system that automatically specializes the JRE for a specific Java application, using jcov [1] to identify the parts that are necessary and the parts that can be removed. This topic contributes in hardening the software supply chain through debloating [2] and specialization of the software stack [3].

Systematic decompilation in the CI to mitigate supply chain attacks
Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Supply chain attacks [1] represent a growing threat on software systems, as illustrated by the Solar Winds attack in late 2020 [2].
One of these attacks consist in tampering with the code at one point in the automatic build pipeline, in order to inject malicious code into the binary.
In this work, we investigate the systematic disassembly of binary [3], at the end of the build pipeline, to detect the injection of malicious code injection.

Detecting superfluous conflicts in Java projects
Build systems and package manager encourage and support reuse of third-party libraries through a clean declaration of third-party dependencies and an automated build process. Such systems include npm, Maven, Cargo. While reuse and automation are strong software engineering principles, the current use of build systems has introduced some new challenges. One of them is that some developers do not remove dependency declarations from the declaration file, resulting in unnecessary libraries being packaged in the built binary. This can account for up to 40% of libraries [1,2].

In this project we study the relationship between these unnecessary dependencies and the presence of conflicts in third-party library versions. If a conflict is due to unnecessary dependency, we call it a superfluous conflict. The goal here is to quantify superfluous conflicts over the whole Maven central repository.

Docker slimming in practice
Supervisors: Benoit Baudry, Cesar Soto-Valero, KTH Royal Institute of Technology

Software debloating consists in removing code that has accumulated over time and that is not necessary for an application anymore [3].
Software bloat emerges in all layers of software stack, from source code, to third-party libraries, build files and containers.
In this project, we focus on docker-slim [1], a tool that analyzes applications to determine what it needs, in order to secure and optimize containers.
The student will investigate to what extent docker-slim is used, what are the different use cases and what are the actual benefits.

API specialization in Kotlin
Supervisors: Benoit Baudry, Cesar Soto-Valero, KTH Royal Institute of Technology

Software applications rely on numerous third-party APIs to reuse existing features (e.g., data processing, security, network, etc.).
Yet, applications use only a small part of the APIs.
The unused parts represent unecessary risks for the security and reliability of the applciation.

In this project, we investigate API specialization to mitigate these risks [1].
This technique first determinines what are the legitimate usages of an API, to build a sense of self [3] for the application API usage.
Then, the specialization consists and in building a proxy that blocks all other API usages at runtime.
This project focuses on specialization for Kotlin APIs [2].

Live analysis of Webassembly in the browser
Supervisors: Benoit Baudry, Javier Cabrera-Arteaga, KTH Royal Institute of Technology

Webassembly is rapidly conquering the world of web technology [1].
Its safe and compact binary format provides great support to consolidate existing applications and to boost the migration of legacy apps to the browser [3].

In this project we will investigate what Webassembly binaries arrive in web browsers.
The project includes the development of efficient technology to collect wasm files live in the browser.
The second part consists in analyzing the live coverage of these files, as well as their purpose.

Software testing

NumPy: boosting the test suite of the Python numerical analysis package

NumPY is a fundamental package for scientific computing with Python, as well as an excellent illustration of state of the art software engineering [1]. For example, the NumPY community uses four different continuous integration systems [2]. Its crucial importance for science calls for a rock-solid test suite, in order to ensure the validity and reproducibility of scientific experiments.
You will dive deep into the test suite of NumPy and aim at making it stronger through a systematic assessment of the test cases. You will investigate the presence of pseudo-tested methods [3] and contribute test improvement to NumPy’s test suite.

Automatic synthesis of Java Mock objects based on Production observations

Mock objects are highly valuable to create predictable test environments, which speed-up test execution and limit flaky tests. Yet, the development of relevant mock objects is challenging, since there is currently no support to determine the validity or value of manually selected values for mocks.
You will design a system that observe an application in production in order to collect real program state values that will then be turned into mock objects. This system will leverage efficient observability technology [3] in order to contribute to the state of the art of automated test generation [1,2].

Effectiveness of diverse coverage tools for Java

Code coverage is a key metric to assess test suite quality as well as to perform dynamic analyses [1]. Yet, there exist a variety of test coverage tools, each with their strengths and quirks [1,2].
You will design and perform a systematic analysis of the main coverage tools for a specific programming language, e.g. Java [2,3,4], in order to determine which is the most appropriate combination of tools for the most accurate measurement of full coverage.

From JSON to Java records
Pankti records program states in production in order to generate differential unit tests that can improve the original test suite of an application [1]. Currently, the states are serialized in JSON, then the generated test includes instructions to deserialize the objects. In this thesis, you will investigate how to generate Java records [2] as part of the test harness. This will make more readable test cases that are not overloaded with deserialization instructions. Java records were introduced in Java 14, and aim to simplify the way we create a POJO (Plain Old Java Objects).

Amplifying Kotlin library test suites with client usages

Third-party libraries are at the core of the software supply chain [1]. Their test suites are essential to ensure the quality of this infrastructure.
One solution to consolidate these test suites consists in carving additional test cases by running the clients of these libraries [2].
You will design, implement and evaluate a test carving tool for Java libraries [3].

Software diversification

Diversifying a npm registry

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Dependency confusion is a growing threat for software supply chain [1]. This attack consists in uploading malicious packages on public repositories, which will eventually be packaged in applications, through dependency resolution mechanisms. In this work, we will explore the automatic randomization of instructions [3] in private npm registries to mitigate dependency confusion [2]. The student will deploy a local npm registry and a instruction randomization scheme, along with the adaptation of the javascript engine to correctly execute the randomized packages.

Polymorphing GraphQL queries

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

GraphQL is increasingly adopted for web APIs [1], making it a good target for exploits [2]. In this work investigate polymorphing to harden GraphQL APIs [3]. The student will develop a randomization scheme for the API and the corresponding adaptation of the client queries in order to build an effective protection against injection attacks.

Diverse Multi-compilation for Trusting trust

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

The problem of deceptive compilers introducing malicious code is relevant and hard [1,2]. One solution for this is to use multiple diverse compilers to mitigate the problem [3]. For instance, one can compile a C program with both GCC and CLANG. You will design, implement and evaluate a multi-compiler scheme for C.

Automatic generation of 1 Million libc

libc is at the core of most software stacks, but it is fragile, prone to critical vulnerabilities [1]. In this work we explore a combination of techniques to generate large amounts of diverse implementations of libc [2]. The student will combine the abundant combinations of flags of C compilers [3], with state of the art code transformation and obfuscation techniques [4] to generate many libs variants.

Java – Kotlin translation to diversify bytecode

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

The transition from Java to Kotlin is timely and hard problem [1].
In this work, we explore the natural diversity of translation strategies from Java to Kotlin [2], as well as the diversity of compilation options of koltinc [3] and javac [4]. The goal is to assess the ability of these strategies to generate diverse versions of Java bytecode for the same piece of source code.

Superdiversifying SHA256

Software diversity increases the robustness of software systems [1]. Through various transformations and randomization, it is possible to automatically generate variants of a program. These variants should have minimal impact on convenience, usability, and efficiency. Meanwhile, each variant should not be sensible to the same bug or vulnerability.
In this project, we explore the large-scale diversification of SHA256 [2]. This family of hashing functions is essential for cryptography, and hence a critical feature for security. The student will investigate superdiversification [3] and the composition of multiple diversification techniques, in order to synthesize large amounts of variants for an implementation of SHA256.

Automatic synthesis of diverse replacements for Java expressions

State of the art program synthesis techniques support the automatic generation of code snippets based on input / output examples or template expressions. In this work, we wish to experiment these techniques in order to replace existing code snippets written by developers by synthetic ones. The objective to to generate program variants that are semantically similar but which executions are different.
Code analysis + code synthesis to generate diverse variant programs

Automatic diversification of Kafka

Automatic software diversity consists in generating multiple variants of an application, which provide the same functionality, with diverse implementations.
The goal is to minimize the risks of having a single point of failure.
In this project, we aim at automatically synthesizing diverse variants of applications that stream data with Kafka [1]. Diversification will be on Kafka itself, e.g., build the application with different versions of Kafka. We will also leverage the natural emergence of the Kafka compatible streaming library, Redpanda [2].

  • [1] Kafka
  • [2] redpanda
  • [3] The multiple facets of software diversity: Recent developments in year 2000 and beyond

Off the beaten track

Github repositories with literary references

Github repositories are rich sources of code, documentation and discussions. They also contain amazing resources such as images, sound snippets, texts or references. A recent study has analyzed the presence of links to academic papers in Github repositories [1]. This study reveals the critical importance of linking code, data and publications to improve replication in computational science. In this work we wish to explore literary references in Github. For example, references to Bob Dylan cited in C code or novel quotes in comments, perl -le’$_=`perldoc -T perlfaq4`,s/^.*N;(.*?)E.*$/$1/s,print’.

The study seeks to unveil the deep connection of Github with culture and society and to analyze the role of literature on software development.

Anatomy of Outlook mail

Everyday we use extraordinary software objects. Examples of such objects include the Android mobile systems that run on billions of devices, the domain name system that runs the web, or the Outlook email client that lets millions of workers communicate efficiently. These objects are extraordinary in several respects: they are large, they are composed of hundreds of diverse software parts, they evolve fast, they exist in many versions that are tailored to various needs. The massive presence of such objects, as well as the very large dimensions that characterize them are intriguing for software developers and for users. One approach to unveil the extraordinary nature of these objects consists in breaking down all of its components turning into an anatomical analysis of the object [1,2].

In this work, we aim at building a fine-grained anatomy of an extraordinary, extremely popular software object: the Outlook email client.

Easter egg VM flag

Easter eggs, sometimes called the final frontier of software development [10]. (Except that of course you can’t have a final frontier, because there’d be nothing for it to be a frontier to, but as frontiers go, it’s pretty penultimate . . .) [269696]. And against the wash of continuous integration a commit hangs, bloated and poetic, one single, cool contribution, gleaming like the madness of gods. Nearly unreal. Reality is not digital, an on-off state, but analog. Easter eggs are for lovers and for the mind. Not enterprise, nor a resurrection, they cherish enchantment and freedom. In the quest for technology and Mastery, you will add an extra mile to the frontier with a new Easter flag for an extraordinary virtual machine [42].

[42] java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version
[10] Curated list of all the easter eggs and hidden jokes in Python.
[269696] Moving Pictures. T. Pratchett. 1990, on Monday afternoon, just before tea.

Web stalker: deconstructing modern browser technology (remix)

In 1998, Simon Pope, Colin Green and Matthew Fuller designed the Web Stalker, an alternative web browser that displays the structure of web pages instead of its content [1]. The work was motivated by a strong motivation to understand what happens beyond the screen and to let web users experience this understanding. Twenty years later, the adoption of the web has massively radiated in all aspects of our lives and the complexity of web browsers has exploded.
This project is about rethinking a web stalker in the era of modern web browsers, going from the design of a solution that leverages the architecture of these browsers [2] to the implementation of an artistic representation of web pages content based on Electron [3].

Paint Splatters & Perl Programs (remix)

In 2019, Colin Mc Millen and Tim Toady ran an experiment to answer one question: is it possible to smear paint on the wall without creating valid Perl? This is an essential question at the forefront of art / computing frontier.
In this project, we will reproduce Mc Millen’s experiment [1], starting with the curated dataset provided by the authors [2]. We will then elaborate on the findings with original splatters and an exploration of Perl’s diverse ecosystem [3].