License validation

Scope

The scope of this document is to describe a suitable system to deal with license requirements and compliance validation.

Terminology and concepts

Agent

Software component responsible for the extraction of licensing information from source packages

Legal right created by the law of a country that grants the creator of an original work exclusive rights for its use and distribution

License

Legal instrument (usually by way of contract law, with or without printed material) governing the use or redistribution of software

Ninka

Standalone license scanner that can also be used as FOSSology agent

Nomos

FOSSology agent license scanner

OBS

Open Build Service

OSS

Open Source Software

Tools under review

Generic license check tools

The tools listed below allow users to extract licensing information by scanning source code. They can operate at different levels of granularity, from a single source code file, to source tar packages, to ISO images containing source packages.

These tools are not tied to any specific distribution and are focused on Open Source licenses.

FOSSology

FOSSology is a framework, a toolbox and web application for examining software packages in a multi-user environment.

From the web application or using web API with CLI, a user can upload individual files or entire software packages to be scanned. FOSSology then will unpack the uploaded data if necessary and run a chosen set of agents on every extracted file.

FOSSology framework currently focuses on licensing checks, but it could be used in combination with agents aimed at doing different kinds of tasks such static code analysis.

In particular, its current toolkit can run licensing, copyright and export control scans from the command line.

The web application adds a web UI and a database to provide a compliance workflow. In one click it can generate a SPDX file, or a ReadMe with the copyrights notices from shipped software.

FOSSology also deduplicates the entries to be analized, which means that it can scan an entire distribution and when a new version is submitted only the files that actually changed will get rescanned.

FOSSology has many different interesting features:

  • Regular expression scanning for licenses with Nomos
  • Text-similarity matching with Monk
  • Copyrights search
  • Export Control Codes (ECC)
  • Bucket processing
  • License reviewing
  • License text management
  • Mark a license as main license of a software package
  • Bulk recognition. Text phrase scan to identify files with similar license contents that are recurring across multiple files
  • Aggregated file view
  • Reuse of license reviews
  • Export information in different formats:
    • Readme files for the distribution containing all identified license texts and copyright information
    • List of files in hierarchical structure with found licenses identified by the short name identifier
    • SPDX 2.0 export using the tag-value and the RDF-(XML)-format
    • Debian-copyright (a.k.a. DEP5) files

Backend tools and scanners are written in C/C++ and the frontend web application is implemented with PHP.

Ninka

Ninka source is a lightweight license identification tool for source code. It is sentence-based, and provides a simple way to identify open source licenses in a source code file. It is capable of identifying several dozen different licenses (and their variations).

Ninka has been designed with the following design goals:

  • To be lightweight
  • To be fast
  • To avoid making errors

FOSSology has recently added support for Ninka as agent. It is mainly written in Perl.

scancode-toolkit

scancode-toolkit scans code and detects licenses, copyrights, packages manifests and dependencies. It is used to discover and inventory Open Source and third-party packages used in projects and can generate SPDX documents.

Given a codebase in a directory, scancode will:

  • Collect an inventory of the code files and classify the code using file types
  • Extract files from any archive using a general purpose extractor
  • Extract texts from binary files if needed
  • Use an extensible rules engine to detect open source license text and notices
  • Use a specialized parser to capture copyright statements
  • Identify packaged code and collect metadata from packages
  • Report the results in your choice of JSON or HTML for integration with other tools
  • Display the results in a local HTML browser application to assist your analysis

ScanCode is written in Python and also uses other open source packages.

licensed

licensed has been recently released by GitHub to check the licenses of the dependencies of a project.

Modern language package managers (bower, bundler, cabal, go, npm, stack) are used to pull the dependency chain of a specific project.

Licenses can be configured to be either accepted or rejected, easing the developer task of identifying problematic dependencies when importing a new third-party library.

Debian centric license check tools

Tools below focus on Debian-derived environments, and work with DEP5 debian/copyright file format and/or Debian packages.

licensecheck

licensecheck scans source code and reports found copyright holders and known licenses. Its approach is to detect licenses with a dataset (medium:~200 regexes) of regex patterns and key phrases (parts) and to reassemble these in detected licenses based on rules. In that sense this is somewhat similar to the combined approaches of FOSSology/nomos and Ninka. It also detects copyright statements. It output results in plain text (with customizable delimiter) or a Debian copyright file format. Written in Perl.

Auto generating a debian/copyright can be easily accomplished by:

licensecheck --copyright -r `find * -type f` | \
  /usr/lib/cdbs/licensecheck2dep5 > debian/copyright.auto

debmake

debmake is a program helper to generate Debian packages, which contains options for checking copyright+license (-c) and compare `debian/copyright against current sources and exit (-k). Written in Python.

Auto generating a debian/copyright can be easily accomplished by:

debmake -cc > debian/copyright

Compare new sources against upstream new sources:

debmake -k

It focus on license types and file matching, and is able to detect ineffective blocks in the copyright file.

It is buggy due to faulty unicode handling.

license-reconcile

An alternative for comparison of debian/copyright versus current source tree is also provided by license-reconcile. It reports missing copyright holders and years, but during testing it was confused by inconsistent license names.

license-reconcile attempts to match license and copyright information in a directory with the information available in debian/copyright. It gets most of its data from licensecheck so should produce something worth looking at out of the box. However for a given package it can be configured to succeed in a known good state, so that if on subsequent upstream updates it fails, it points out what needs looking at.

It can be particularly useful once a package has been configured to make it succeed, so that any failure on subsequent upstream updates can be used to pay attention to licensing changes that must be acknowledged.

cme

cme option is based on a config parsing library.

cme update dpkg-copyright

This will create or update debian/copyright. The cme tool seem to handle UTF-8 names better than debmake. Written in Perl, using licensecheck.

elbe-parselicense

elbe-parselicense generates a file containing the licences of the packages included in a project.

dlt

dlt has support for parsing and creating Debian machine readable copyright files. Written in Python.

Most of the tools discussed in the previous section are very useful in a way or the other and some build on top of others. For the Apertis use case, it is advisable to use some tool which already provides a framework to deal with licenses and copyrights. The other tools can be hooked in different processes for particular use cases, if those are needed, or those can be used to double or triple check the output from other tools, if desireed. A good starting point is FOSSology, which already provides a database and keeps track of licenses and copyrights, it supports SPDX and DEP5 output formats and its architecture is easily extendable via plugins. Therefore this proposal recommends to use FOSSology as a start. After initial setup is accomplished and workflow defined, it can be fine tuned considering the other tools or extending FOSSology with such support.

Integration with current tools

In the current Apertis CI infrastructure, there are several stages:

  • Phabricator (code review) - source code review system
  • Jenkins (buildpackage CI) - CI build per source package code changes
  • Open Build Service (distro) - contains all the distribution packages
  • Jenkins (images) - builds images from distributed package repository pools
  • LAVA (testing) - manages automated tests for different set of images
  • Phabricator (bugtracker) - keeps track of image defects

As initial step, it looks plausible to hook FOSSology after a new source package is added or updated in Open Build Service. That way FOSSology database should contain all needed data regarding licenses and copyrights and it can be queried to extract information when needed.

Approach

The following proposal outlines the way FOSSology is meant to interact with other parts of system.

Inputs

  • FOSSology server will be fed with source code tarballs from repositories, starting by adding packages which conform the target runtime into FOSSology bucket.
  • A list of software packages that conform target image runtime will be provided to FOSSology.

Deliverable

  • A SPDX and/or DEP5 license report of software packages found in the target runtime image.

Every release should have a license report

WIP: Setup Configuration Clearing licenses Rules setup Day to day operation Notifications Generating a report

TBD: FOSSology manual workflow for clearing licenses

References

Machine-readable debian/copyright file

Creating, updating and checking debian/copyright semi-automatically

debmake -- checking source against DEP-5 copyright

Improving creation of debian copyright file

scancode-toolkit wiki

Mozilla's Fossology investigation

The results of the search are