Research topics for interns, Masters or thesis

Software hardening


Embedding the software supply chain at runtime with Java classloaders
Ultra small code with GraalVM and debloating
Full-stack debloating for a video conferencing system
Reproducible builds for Maven
Leveraging the diversity of bundlers for debloating JavaScript applications
Automatic specialization of the Java Runtime (JRE)
Systematic decompilation in the CI to mitigate supply chain attacks
Detecting superfluous conflicts in Java projects
Docker slimming in practice
API specialization in Kotlin
The software supply chain of creative coding
The natural diversity of fake generation

Software testing

Test Generation for Ethereum Clients Using Production Data
Code Coverage in Production
Live analysis of Webassembly in the browser
NumPy: boosting the test suite of the Python numerical analysis package
Automatic synthesis of Java Mock objects based on Production observations
Effectiveness of diverse coverage tools for Java
Amplifying Kotlin library test suites with client usages
From JSON to Java records

Software diversification

Diverse build pipelines
Diverse execution environments with infrastructure as code
Github copilot for automatic diversity
Diversifying a npm registry
Polymorphing GraphQL queries
Diverse Multi-compilation for Trusting trust
Java – Kotlin translation to diversify bytecode
Automatic generation of 1 Million libc
Automatic synthesis of diverse replacements for Java expressions
Superdiversifying SHA256
Automatic diversification of Kafka

Off the beaten track

Code by singing in eso-lang
Github repositories with literary references
The anatomy of the most Enterprise email client
Easter egg VM flag
Web stalker: deconstructing modern browser technology (remix)
Paint Splatters & Perl Programs (remix)

Software hardening

Embedding the software supply chain at runtime with Java classloaders

In Java, class loading refers to retrieving the binary form of a class or interface and constructing, from that binary form, a class object to represent the class or interface [1]. Today, different subclasses of the `ClassLoader` may implement different loading policies [2]. For example, a class loader may cache the binary representation of a class, prefetch it based on expected usage, or load a group of related classes together. These activities may not be completely transparent to a running application. In this context, determining the third-party suppliers of classes loaded at runtime allows for controlling and hardening the software supply chain of third-party components used during program execution. Monitoring the origins of the “actually” executed code is a critical task for building more reliable and secure systems. The student will design and implement a novel software tool to build a representation of the software supply chain at runtime.

Ultra small code with GraalVM and debloating
GraalVM compiles Java code to native, boosting deployment and runtime performance. Meanwhile, code debloating [2] removes unnecessary code from applications, reducing code size and attack surface. Both techniques are actively researched in the Java ecosystem[2,3]. In this work, we will you use both techniques in conjunction to take code reduction one step further. We will experiment with debloating before, as well as after the GraalVM compilation to understand where the largest code size savings can be performed. Quarkus [4] might be used to reduce one more step.

Full-stack debloating for a video conferencing system
Software bloat is data and code that accumulates over time and yet is not necessary for an application to behave correctly. Several techniques have been proposed over the last years to detect and remove bloat. These techniques complement each other since they analyze bloat at different levels of the software stack (libraries, containers, kernel, etc.). Yet, no previous work has studied the combined effect of these techniques
For this thesis you will apply different debloating techniques such as DepClean [1], docker-slim [2] and unikernels [3]. You will measure the effects of each technique and their combination on the jitsi video conferencing system.

Reproducible builds for Maven
Supervisor: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Reproducible builds is an essential property for secure software supply chains [1]. There is ongoing effort in some Linux distributions, in particular Debian, to ensure reproducible builds [2]. In the Java world, there is little work on this topic and no clear understanding of the problem. You will design, perform and analyze an experiment to assess the status quo of reproducible builds in Java and a tool to improve build reproducibility.

Leveraging the diversity of bundlers for debloating JavaScript applications

JavaScript is the most used programming language for the development of web applications. Once the web application grows, so does the bundle size, primarily due to all its third-party dependencies [1,2]. A bundler is a tool that transforms all the JavaScript code and its dependencies into a new output file with everything merged (including other files such as HTML, CSS, and PNG). There are many production-ready JavaScript bundlers (e.g., Webpack, Rollup, Browserify, ESbuild, and Parcel). They can perform optimizations and minifications on the bundle, such as tree shaking, scope hoisting, bundle splitting, and minifying [4]. However, the size reduction achieved by a bundler is limited by its own code minimization technique [3]. The student will perform an experimental study to leverage the diversity of JavaScript bundlers in order to reduce the original code size of applications while keeping the functionality required to pass all test cases in their test suites.

  • [1] Slimming JavaScript Applications: An Approach for Removing Unused Functions From JavaScript libraries (JSS), 2019
  • [2] Evolving JavaScript Code to Reduce Load Time (TSE), 2021
  • [3] Stubbifier: Debloating Dynamic Server-Side JavaScript Applications (ArXiv), 2021
  • [4] https://webpack.js.org/guides/tree-shaking/

Automatic specialization of the Java Runtime (JRE)E

The Java Runtime Environment (JRE) is a great, general purpose execution engine, which provides the standard Java libraries.
Because it is general purpose, it offers too much functionality, when considering only one Java application that runs in the JRE.
You will design and experiment with a system that automatically specializes the JRE for a specific Java application, using jcov [1] to identify the parts that are necessary and the parts that can be removed. This topic contributes in hardening the software supply chain through debloating [2] and specialization of the software stack [3].

Systematic decompilation in the CI to mitigate supply chain attacks
Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Supply chain attacks [1] represent a growing threat on software systems, as illustrated by the Solar Winds attack in late 2020 [2].
One of these attacks consist in tampering with the code at one point in the automatic build pipeline, in order to inject malicious code into the binary.
In this work, we investigate the systematic disassembly of binary [3], at the end of the build pipeline, to detect the injection of malicious code injection.

Detecting superfluous conflicts in Java projects
Build systems and package manager encourage and support reuse of third-party libraries through a clean declaration of third-party dependencies and an automated build process. Such systems include npm, Maven, Cargo. While reuse and automation are strong software engineering principles, the current use of build systems has introduced some new challenges. One of them is that some developers do not remove dependency declarations from the declaration file, resulting in unnecessary libraries being packaged in the built binary. This can account for up to 40% of libraries [1,2].

In this project we study the relationship between these unnecessary dependencies and the presence of conflicts in third-party library versions. If a conflict is due to unnecessary dependency, we call it a superfluous conflict. The goal here is to quantify superfluous conflicts over the whole Maven central repository.

Docker slimming in practice
Supervisors: Benoit Baudry, Cesar Soto-Valero, KTH Royal Institute of Technology

Software debloating consists in removing code that has accumulated over time and that is not necessary for an application anymore [3].
Software bloat emerges in all layers of software stack, from source code, to third-party libraries, build files and containers.
In this project, we focus on docker-slim [1], a tool that analyzes applications to determine what it needs, in order to secure and optimize containers.
The student will investigate to what extent docker-slim is used, what are the different use cases and what are the actual benefits.

API specialization in Kotlin
Supervisors: Benoit Baudry, Cesar Soto-Valero, KTH Royal Institute of Technology

Software applications rely on numerous third-party APIs to reuse existing features (e.g., data processing, security, network, etc.).
Yet, applications use only a small part of the APIs.
The unused parts represent unecessary risks for the security and reliability of the applciation.

In this project, we investigate API specialization to mitigate these risks [1].
This technique first determinines what are the legitimate usages of an API, to build a sense of self [3] for the application API usage.
Then, the specialization consists and in building a proxy that blocks all other API usages at runtime.
This project focuses on specialization for Kotlin APIs [2].

The natural diversity of fake generation

The automatic synthesis of fakes is a powerful technique for cybersecurity and cyber decoys [1]. The key challenge for this approach is to generate fakes that look-alike real, legitimate artefacts. This is essential so that malicious actors believe that they are collecting real documents. Yet, this is hard, as it requires an automated oracle to determine the ‘realism’ of the fake. In this project, we explore the natural diversity of software technology to synthesize different kinds of fake data and documents [2].

The software supply chain of creative coding

Artists use adavanced software technology to produce, distribute and even generate artworks. Such software technology includes libraries for sound synthesis [1], visual art[2,3], augmented reality [4], as well as platforms to distribute artworks [5,6]. In this work, we dive deep in this software ecosystem to draw a systematic landscape of the software supply chain [7] for creative coding.

https://github.com/topics/fake

Software testing

Test Generation for Ethereum Clients Using Production Data
Supervisors: Martin Monperrus, Benoit Baudry

Description: Unit testing is one of the essential ways to improve the quality of software It is also helpful for correctness checking when there are different implementations based on the same software specification. Let us take Ethereum clients as an example, there are thousands of common tests [1] provided for all the Ethereum client projects. Though these tests have already cover various cases, there are corner cases in production that are missing in the test suite [2]. In this thesis project, you will design, implement and evaluate a prototype that collects production data and generate new valuable test cases for Ethereum clients.

Code Coverage in Production
Supervisors: Martin Monperrus, Benoit Baudry

Description: Code coverage usually relates to test code. Production code coverage is the coverage over real interactions made by users in production. Obtaining and analysing production code coverage enables to identify useless code as well as relevant test data and values. It enables testers and developers to better align the test intentions with what matters for users. The student will compare and analyze techniques for automatically collecting code coverage in production for Java software.

Live analysis of Webassembly in the browser
Supervisors: Benoit Baudry, Javier Cabrera-Arteaga, KTH Royal Institute of Technology

Webassembly is rapidly conquering the world of web technology [1].
Its safe and compact binary format provides great support to consolidate existing applications and to boost the migration of legacy apps to the browser [3].

In this project we will investigate what Webassembly binaries arrive in web browsers.
The project includes the development of efficient technology to collect wasm files live in the browser.
The second part consists in analyzing the live coverage of these files, as well as their purpose.

NumPy: boosting the test suite of the Python numerical analysis package

NumPY is a fundamental package for scientific computing with Python, as well as an excellent illustration of state of the art software engineering [1]. For example, the NumPY community uses four different continuous integration systems [2]. Its crucial importance for science calls for a rock-solid test suite, in order to ensure the validity and reproducibility of scientific experiments.
You will dive deep into the test suite of NumPy and aim at making it stronger through a systematic assessment of the test cases. You will investigate the presence of pseudo-tested methods [3] and contribute test improvement to NumPy’s test suite.

Automatic synthesis of Java Mock objects based on Production observations

Mock objects are highly valuable to create predictable test environments, which speed-up test execution and limit flaky tests. Yet, the development of relevant mock objects is challenging, since there is currently no support to determine the validity or value of manually selected values for mocks.
You will design a system that observe an application in production in order to collect real program state values that will then be turned into mock objects. This system will leverage efficient observability technology [3] in order to contribute to the state of the art of automated test generation [1,2].

Effectiveness of diverse coverage tools for Java

Code coverage is a key metric to assess test suite quality as well as to perform dynamic analyses [1]. Yet, there exist a variety of test coverage tools, each with their strengths and quirks [1,2].
You will design and perform a systematic analysis of the main coverage tools for a specific programming language, e.g. Java [2,3,4], in order to determine which is the most appropriate combination of tools for the most accurate measurement of full coverage.

From JSON to Java records
Pankti records program states in production in order to generate differential unit tests that can improve the original test suite of an application [1]. Currently, the states are serialized in JSON, then the generated test includes instructions to deserialize the objects. In this thesis, you will investigate how to generate Java records [2] as part of the test harness. This will make more readable test cases that are not overloaded with deserialization instructions. Java records were introduced in Java 14, and aim to simplify the way we create a POJO (Plain Old Java Objects).

Amplifying Kotlin library test suites with client usages

Third-party libraries are at the core of the software supply chain [1]. Their test suites are essential to ensure the quality of this infrastructure.
One solution to consolidate these test suites consists in carving additional test cases by running the clients of these libraries [2].
You will design, implement and evaluate a test carving tool for Java libraries [3].

Software diversification

Diverse build pipelines

Some software projects with strong reliability and security constraints build their product with more than one build pipeline. This also an approach to address the challenge of trusting trust [1]. For example, the NumPy open source project for scientific computing uses four continuous integration systems [2]. Following an attack against its Orion product, the Solarwinds company started using diverse build systems [3]. In this work, the student will experiment with integrating diversity in existing build pipelines. For example, the student will investigate duplicating a Travis CI pipeline with Github actions and assess the impact of this diversity of build technology.

Diverse execution environments with infrastructure as code

Infrastructure as code is about provisioning execution resources through executable configuration files [1]. In this context, the execution of program provisions a whole environment to execute an application. A variation of the same program will provision a different environment to run the same application. In this project the student will explore transformations for infrastructure as code with the intention of creating a moving target at the environment level [2]. We consider using Modus to define the infrastructure [3].

Github copilot for automatic diversity

Github copilot, a.k.a an AI pair programmer, generates suggestions for lines of code, or entire functions [1]. It is based on an immense set of code written by human developers in order to synthesize new code in a new context. In this work, we wish to experiment these techniques in order to replace existing code snippets written by developers by synthetic ones. The objective to to generate program variants that are semantically similar but which executions are different.

Diversifying a npm registry

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

Dependency confusion is a growing threat for software supply chain [1]. This attack consists in uploading malicious packages on public repositories, which will eventually be packaged in applications, through dependency resolution mechanisms. In this work, we will explore the automatic randomization of instructions [3] in private npm registries to mitigate dependency confusion [2]. The student will deploy a local npm registry and a instruction randomization scheme, along with the adaptation of the javascript engine to correctly execute the randomized packages.

Polymorphing GraphQL queries

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

GraphQL is increasingly adopted for web APIs [1], making it a good target for exploits [2]. In this work investigate polymorphing to harden GraphQL APIs [3]. The student will develop a randomization scheme for the API and the corresponding adaptation of the client queries in order to build an effective protection against injection attacks.

Diverse Multi-compilation for Trusting trust

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

The problem of deceptive compilers introducing malicious code is relevant and hard [1,2]. One solution for this is to use multiple diverse compilers to mitigate the problem [3]. For instance, one can compile a C program with both GCC and CLANG. You will design, implement and evaluate a multi-compiler scheme for C.

Automatic generation of 1 Million libc

libc is at the core of most software stacks, but it is fragile, prone to critical vulnerabilities [1]. In this work we explore a combination of techniques to generate large amounts of diverse implementations of libc [2]. The student will combine the abundant combinations of flags of C compilers [3], with state of the art code transformation and obfuscation techniques [4] to generate many libs variants.

Java – Kotlin translation to diversify bytecode

Supervisors: Benoit Baudry, Martin Monperrus, KTH Royal Institute of Technology

The transition from Java to Kotlin is timely and hard problem [1].
In this work, we explore the natural diversity of translation strategies from Java to Kotlin [2], as well as the diversity of compilation options of koltinc [3] and javac [4]. The goal is to assess the ability of these strategies to generate diverse versions of Java bytecode for the same piece of source code.

Superdiversifying SHA256

Software diversity increases the robustness of software systems [1]. Through various transformations and randomization, it is possible to automatically generate variants of a program. These variants should have minimal impact on convenience, usability, and efficiency. Meanwhile, each variant should not be sensible to the same bug or vulnerability.
In this project, we explore the large-scale diversification of SHA256 [2]. This family of hashing functions is essential for cryptography, and hence a critical feature for security. The student will investigate superdiversification [3] and the composition of multiple diversification techniques, in order to synthesize large amounts of variants for an implementation of SHA256.

Automatic diversification of Kafka

Automatic software diversity consists in generating multiple variants of an application, which provide the same functionality, with diverse implementations.
The goal is to minimize the risks of having a single point of failure.
In this project, we aim at automatically synthesizing diverse variants of applications that stream data with Kafka [1]. Diversification will be on Kafka itself, e.g., build the application with different versions of Kafka. We will also leverage the natural emergence of the Kafka compatible streaming library, Redpanda [2].

  • [1] Kafka
  • [2] redpanda
  • [3] The multiple facets of software diversity: Recent developments in year 2000 and beyond

Off the beaten track

Code by singing for eso-lang

The progress of voice recognition and speech-to-text technology is fabulous. It opens the way towards, coding by voice, a very promising advance to open the world of programming to a wider population [1].
In this thesis, we will explore the possibilities of writing code by singing. This master thesis at the intersection of software technology, signal processing and rickrolling will be disseminated as part of a growing eso-lang [2].

Github repositories with literary references

Github repositories are rich sources of code, documentation and discussions. They also contain amazing resources such as images, sound snippets, texts or references. A recent study has analyzed the presence of links to academic papers in Github repositories [1]. This study reveals the critical importance of linking code, data and publications to improve replication in computational science. In this work we wish to explore literary references in Github. For example, references to Bob Dylan cited in C code or novel quotes in comments, perl -le’$_=`perldoc -T perlfaq4`,s/^.*N;(.*?)E.*$/$1/s,print’.

The study seeks to unveil the deep connection of Github with culture and society and to analyze the role of literature on software development.

Anatomy of Outlook mail

Everyday we use extraordinary software objects. Examples of such objects include the Android mobile systems that run on billions of devices, the domain name system that runs the web, or the Outlook email client that lets millions of workers communicate efficiently. These objects are extraordinary in several respects: they are large, they are composed of hundreds of diverse software parts, they evolve fast, they exist in many versions that are tailored to various needs. The massive presence of such objects, as well as the very large dimensions that characterize them are intriguing for software developers and for users. One approach to unveil the extraordinary nature of these objects consists in breaking down all of its components turning into an anatomical analysis of the object [1,2].

In this work, we aim at building a fine-grained anatomy of an extraordinary, extremely popular software object: the Outlook email client.

Easter egg VM flag

Easter eggs, sometimes called the final frontier of software development [10]. (Except that of course you can’t have a final frontier, because there’d be nothing for it to be a frontier to, but as frontiers go, it’s pretty penultimate . . .) [269696]. And against the wash of continuous integration a commit hangs, bloated and poetic, one single, cool contribution, gleaming like the madness of gods. Nearly unreal. Reality is not digital, an on-off state, but analog. Easter eggs are for lovers and for the mind. Not enterprise, nor a resurrection, they cherish enchantment and freedom. In the quest for technology and Mastery, you will add an extra mile to the frontier with a new Easter flag for an extraordinary virtual machine [42].

[42] java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version
[10] Curated list of all the easter eggs and hidden jokes in Python.
[269696] Moving Pictures. T. Pratchett. 1990, on Monday afternoon, just before tea.

Web stalker: deconstructing modern browser technology (remix)

In 1998, Simon Pope, Colin Green and Matthew Fuller designed the Web Stalker, an alternative web browser that displays the structure of web pages instead of its content [1]. The work was motivated by a strong motivation to understand what happens beyond the screen and to let web users experience this understanding. Twenty years later, the adoption of the web has massively radiated in all aspects of our lives and the complexity of web browsers has exploded.
This project is about rethinking a web stalker in the era of modern web browsers, going from the design of a solution that leverages the architecture of these browsers [2] to the implementation of an artistic representation of web pages content based on Electron [3].

Paint Splatters & Perl Programs (remix)

In 2019, Colin Mc Millen and Tim Toady ran an experiment to answer one question: is it possible to smear paint on the wall without creating valid Perl? This is an essential question at the forefront of art / computing frontier.
In this project, we will reproduce Mc Millen’s experiment [1], starting with the curated dataset provided by the authors [2]. We will then elaborate on the findings with original splatters and an exploration of Perl’s diverse ecosystem [3].