DUETS: A Dataset of Reproducible Pairs of Java Library-Clients

DUETS, is a dataset of software libraries and their clients. This dataset can be exploited to gain many different insights, such as API usage, usage inputs, or novel observations about the test suites of clients and libraries. DUETS is meant to support both static and dynamic analysis. This means that the libraries and the clients compile correctly, they are executable and their test suites pass. The dataset is composed of open-source projects that have more than five stars on GitHub. The final dataset contains 395 libraries and 2,874 clients. Additionally, we provide the raw data that we use to create this dataset, such as 34,560 pom.xml files or the complete file list from 34,560 projects. This dataset can be used to study how libraries are used by their clients or as a list of software projects that successfully build. The client’s test suite can be used as an additional verification step for code transformation techniques that modify the libraries.

Maven central dependency graph

The Maven dependency graph is an open dataset of Maven Central artifacts, their dependencies, as well as other relationships. Its main intent is to domesticate the wild within and around the Maven central ecosystem, in particular, and JVM-based libraries at large, making it more harnessable to both academics and industry. It is intended to answer high-level research questions concerning artifacts releases, evolution, and usage trends over time. It can also be used to assist researchers in selecting relevant datasets, among the mass of existing software artifact, for assessing particular empirical software engineering challenges. The complexity of these questions can range from simple pattern matching to advanced big data analysis and machine learning techniques.

The accompanying paper to this dataset is has been accepted for publication in the proceedings of the International Conference on Mining Software Repositories 2019 and has received the MSR 2019 Data Showcase Award. This paper is available for download on arXiv.

  • Location: Zenodo
  • Content: 2.4M Maven artefacts metadata.

New use cases for Sosiefication

In the context of the DIVERSIFY project, we investigate the automatic generation of diverse program variants that are all functionally similar. This work is based on Tailored Source Code Transformations to Synthesize Computationally Diverse Program Variants.
In our most recent work on Automatic Software Diversity in the Light of Test Suites, we have performed experiments with source code and test suites of 6 popular Java libraries

  • Java
  • Location: Github
  • Content: 6 large Java programs with high quality JUnit test suites.

Software monoculture in WordPress and JavaScript

In the context of the DIVERSIFY project, we have collected a large quantity of data about WordPress plugins and JavaScript libraries and show that, despite the huge diversity of available software, websties currently use only a small set of plugins or library, creating a monoculture in web applications.
Check data

Java sources and test suites for Sosiefication

  • Java
  • Location:
  • Content: 9 large Java programs with high quality JUnit test suites.
  • Source: We have chosen projects that are widely used (such as JUnit) and which target good quality through extensive testing (such as apache.common library) .

Download data

We have used this data set to synthesize sosie programs. We have synthesized thousands of sosies and 100 sosies of JUnit are avaible here. The whole results and methodology are published in our paper Tailored Source Code Transformations to Synthesize Computationally Diverse Program Variants that has been presented at ISSTA’14

Java sources with high usage diversity

  • Java
  • Location:
  • Content: 3 418 Jar files, which include 382 774 different types (classes or interfaces).
  • Source: We have collected all Jar files present on a machine used for performing software mining experiments for 7 years.

Download data

We have used this data set for our analysis of API usage diversity is available. The results are published in our paper “Empirical Evidence of Large-Scale Diversity in API Usage of Object-Oriented Software” (Diego Mendez, Benoit Baudry, Martin Monperrus) that has been presented at the SCAM’13 conference.

Metamodels and well-formedness rules

  • Ecore, OCL
  • Location: REMODD model repository
  • Content:
    • 14 metamodels. Five of these metamodels include between 3 and 13 packages, each of which can be considered as an independent metamodel.
    • 1262 well-formedness rules
  • Source: We have collected this data set from the OMG, our industrial partners and an open call to the community (through the planetmde mailing list). The original data was in various formats, but we have made homogeneous (only Ecore and OCL) in this data set.

Download data

We have used this data set to analyze the interactions between two formalisms (Ecore and OCL) for metamodeling. The results are available in a technical report: Ten years of Meta-Object Facility: an Analysis of Metamodeling Practices (Juan Cadavid, Benoit Combemale, Benoit Baudry).