Background

The last phase in the current CI workflow is running the LAVA automated tests, and there is no mechanism in place to properly process and report these tests results which basically leave the CI loop incomplete.

Current Issues

The biggest issues are:

  • Tests results need to be checked manually from LAVA logs and dashboard.

  • Bugs need to be reported manually for tests issues.

  • Weekly test reports need to be created manually. This point might just be partially addressed by this concept document since a proper data storage for test cases and tests results has not been defined.

  • No mechanism in place to conveniently notify tests issues. Critical issues can be very easily overlooked.

Proposal

This document proposes a design around the available infrastructure to implement a solution to close the CI loop.

The document only covers automated tests, and it leaves manual tests for a later proposal with a more complete solution.

Benefits of closing the CI loop

Closing the loop will allow to save time and resources by automating the manual tasks of checking automated tests results and reporting their issues. It will also provide the infrastructure foundation for further improvements in tracking the overall project health.

From a design perspective, it will also help to keep a more complete workflow in place for the whole infrastructure.

Some of the most important benefits:

  • Checking automated tests results will need minimal or no manual intervention.
  • Automated tests failures will be reported automatically on time.
  • It will provide a more consistent and accurate way to track issues found by automated tests.
  • It will help to keep tests reports up to date.
  • It will provide the infrastructure components that will help to implement further improvements in tracking the overall project health.

Though the project as a whole will benefit from the above points, some benefits will be more relevant depending on the project roles and areas. Following subsections give a list of these benefits for each role.

Developers and Testers Benefits

  • It will save time for developers and testers since they won't need to check automated tests logs and tests results manually in order to report issues.
  • Developers will be able to notice and work on critical issues much faster, since failures will have more visibility on time.

Managers Benefits

  • Given automated tests issues will be reported on time and more consistently, that will help managers to take more accurate decisions during planning.

Products Teams Benefits

  • The whole CI workflow for automated tests is properly implemented, so it offers a more complete solution to other teams and projects.
  • Closing the CI loop offers a more coherent infrastructure design that other products teams can adapt to their own needs.
  • Products teams will have a better view of the bugs opened in a given time, thus having a better idea about the overall project health.

Overview of steps to close the CI loop

This is an overview of the phases required to complete closing the current CI loop:

  • Tests results should be fetched from LAVA.
  • Tests results should be processed and optionally saved somewhere.
  • Tests results should be analyzed.
  • Tasks for tests issues should be created using analyzed tests results.

Current Infrastructure

This section explores the different services available from our infrastructure proposed to implement the remaining phases to close the CI loop.

LAVA User Notifications

As of LAVA V2, there is a new feature called Notication Callback which allows to send a GET or POST request to a specified URL to trigger some action remotely. If using the POST request, this allows to attach and send test job information and results.

This can be used to send the tests results back to Jenkins from LAVA for further processing in new pipeline phases.

Jenkins Webhook Plugin

This plugin provides an easy way to block a build pipeline in Jenkins until an external system posts to a webhook.

This can be used to wait for the automated tests results sent by LAVA from a new Jenkins job responsible of triggering the automated tests.

Phabricator API

Conduit is the developer API for Phabricator which can be used to implement the management of tasks.

This API can be used (either with tools or language bindings) to manage Phabricator tasks from a Jenkins phase in the main pipeline.

Mattermost

Mattermost is the chat system used by the Apertis project.

Jenkins already offers a plugin to send messages to mattermost.

This can be used in order to send notifications messages to the chat channels, for example, to notify the team once a critical test starts failing, or when a bug has been updated.

CI Workflow Overview

The main workflow would basically consist on combining the above mentioned technologies to implement the different phases for the main CI pipeline.

A general overview of the steps involved would be:

  • Jenkins build images and trigger LAVA jobs.
  • Use the webHook pipeline plugin to wait for LAVA tests results from Jenkins.
  • LAVA execute automated tests jobs and results are saved in its database.
  • LAVA triggers an user notification callback attaching test job information and results to send to the Jenkins webHook.
  • Tests results are received by Jenkins through the webHook.
  • Test information is sent to a new pipeline to process and analize tests results.
  • Once tests results are processed and analyzed, these are sent to a new pipeline to manage Phabricator tasks.
  • Optionally a new Jenkins phase could notify results to mattermost or via email.

This complete loop will be executed every time new images are built.

Fetching Tests Results

The initial and most important phase to close the loop is fetching and processing the automated tests results from LAVA.

The proposed solution in this document is to use the webHook plugin to fetch the LAVA tests results from Jenkins once the automated test job is finished.

Currently, LAVA tests are submitted in the last stage of the Jenkins pipeline job creating and publishing the images.

Automated tests are organized in groups, which are submitted all at once using the lqa tool for each image type once the images are published.

A webhook should be registered for each test job rather than for a group of tests, so a change in the way LAVA jobs are submitted is required.

Jenkins and LAVA Interaction

The proposed solution is to separate the LAVA job submission stage from the main Jenkins pipeline job building images, and instead have a single Jenkins job that will take care of triggering the automated tests in LAVA once the images are published.

The only required fields for the stage submitting the LAVA tests jobs are the image_name , profile_name, and version of the image. A single Jenkins job could receive these values as arguments and trigger the automated tests for each of the respective images.

The way LAVA jobs are submitted from Jenkins will also require some changes. The lqa tool currently submit several groups of tests jobs at once, but since each test job requires to have an unique webhook, they will need to be submitted independently.

One simple solution is to have lqa processing the job templates first and then submit each processed job file with an unique webhook.

Once all tests jobs are submitted for a specific image type, the Jenkins executor will wait for all of their webhooks. This will block the executor, but since the webhook returns immeditealy for those jobs that already posted the results to the webhook callback, it is fair to say that the executor will only block until the last completed test job sends its results back to Jenkins.

After all results are received in Jenkins, these can be processed by the remaining stages required for tasks management.

Jenkins Jobs

Since images are built from a single Jenkins job, the most sensible approach for final implementation is to have a new Jenkins job receiving all the images types information and triggering tests for all of them, then a different job for processing tests results, and possible another one handling the tasks management phases.

Tasks Management

One of the most important phases in closing the loop is reporting tests issues in Phabricator.

Tests issues will be reported automatically in Phabricator as tasks per test cases instead of tasks per issues. This has an important consequence explained in the Considerations section.

This section gives an overview for the behaviour of this phase.

Workflow Overview

Management of Phabricator tasks can be as follow:

  1. Query Phabricator to find all open tasks with the tag test-failure.

  2. Filter the list of received tasks to make sure only the exact tasks are processed. For this, scanning for further specific fields in the task can be helpful, for example, keeping only tasks with a specific task name format.

  3. Fetch analyzed tests results.

  4. For each test, based on its results and checking the tasks list, do the following:

    a) Task exists: Add a coment to the task.

    b) Task does not exist:

    • If test has status failed: Create a new task.
    • If test has status passed: Do nothing.

Considerations

  • The comment added to the task will contain general information of the failure with a link to the LAVA job logs.

  • Tasks won't be reported per platform but per test case. Once a task for a test case failure is reported, all platforms failures for such a test case should be added as comments to that single task.

  • Closing and verifying tasks will still require manual intervention. This will help avoiding the following corner cases:

    • Flaky tests that would otherwise end up in a series of new tasks that gets autoclosed.
    • Tests failing on one image that also succeed on a different image.
  • If a test starts failing again for a previously closed task, a new task will be created automatically for it, and manual verification is required to check if it is the same previously reported issue, in which case is recommended to add a reference to the old task.

  • If after fixing an issue for a reported task, a new issue arises for the same test case, the same old task will be updated with this new issue. This is an effect of reporting tasks per test cases instead of per issues. In such a case, manual verification can be used to confirm if it is or not the same issue and a new subtask can be manually created by the developer if deemed necessary.

Phabricator Conventions

For automation of the phabricator tasks management, there will be the need of creating certain conventions in phabricator. This will require minimal manual intervention.

First of all a specific user should be created in Phabricator to manage these tasks automatically.

This username could be named apertis-qa or apertis-tests, and its only purpose will be to manage tasks automatically at this stage of the loop.

A special tag and a specific format in the tasks name will also be used in tasks reported by this special user:

  • The tag test-failure is the special tag for automated tests failure.
  • The task name will have the format: "{testcasename} failed: ".
  • A {testcasename} tag can also be used if it is available for the test.

Design and Implementation

This section gives a brief overview of the design for the main components to close the loop.

Each of these components can be developed as independent modules, or as a single unit containing all the logic. The details and final design of these components depend on the most convenient approach chosen during implementation.

Design

Tests Processor

This will take care of processing the tests results as they are received from LAVA.

LAVA tests results are sent to Jenkins in a raw format, so things to do at this level could involve cleaning data or even converting tests results to a new format so they can be more easily processed by the rest of the tools.

Tests Analyzer

This will make sure that the test results data is in a consistent and convenient format to be used by the next module (task manager).

This can be a new tool or just be part of the test result processor running in its same Jenkins phase just for convenience.

Tasks Manager

This will receive the whole tests results data analyzed, and ideally it shouldn't deal with any test data manipulation.

It will take care of comparing the status between tests results and phabricator tasks, decide next steps to do and manage those tasks accordingly.

Notifier

This can be considered an optional component and can involve sending further forms of notifications to different services, for example, send messages to Mattermost channels or emails notifying about new critical bugs.

Implementation

As originally envisioned, each of the design components could be written using a scripting language, preferably one that already offers a good integration with our infrastructure.

The Python language is highly recommended, as it already offers plugin for all the related infrastructure, so it would require a minimal effort to integrate a solution written in this language.

As a suggested environment, Jenkins could be used like the main place to execute and orchestrate each of the components. They could be executed using a different pipeline for each phase, or just a single pipeline executing all the functionality.

For example, once the LAVA results are fetched in Jenkins, a new pipeline phase receiving tests results can be started to execute the test processor and test analyzer, which in turn will send the output to a new pipeline phase to execute the task manager and later (if available) to the notifier.

Diagram

This is a diagram explaining the different infrastructure processes involved in the proposed design to close the CI loop.

Security Improvement

The Jenkins webhook URL will be visible from the public LAVA tests definitions, which might arise security concerns. For example, another process posting to the webhook before LAVA does will break the Jenkins job waiting for the tests results.

After researching several options to solve this issue, one solution has been found which consists in checking for a protected authorization header in Jenkins sent by LAVA when posting to the webhook.

This solution requires changes both in the Jenkins plugin and the LAVA code, and they need to be implemented as part of the solution for closing the CI loop.

Implementation

The final implementation for the solution proposed in this document will mainly involve developing tools that need to be executed in Jenkins and will interact with the rest of the existing infra services: LAVA, Phabricator and optionally Mattermost.

All tools and programms will be available from the project git repositories with their respective documentation, including how to setup and use them.

In addition to this, the final implementation will also include documentation about how to integrate, use and maintain this solution using the currently available infrastructure services, so other teams and projects can also make use of it.

Constraints or Limitations

  • Some errors might not be trivially detected for automated tests, since LAVA can fail in several ways, for example, infra errors sometimes might be difficult to analyze and will still require manual intervention.

  • The webHook plugin blocks the Jenkins pipeline. This might be an issue in the long term and it should be an open point for further researching in later version of this document or during implementation.

  • This document deals with the existing infra, so a proper data storage has not been defined for test cases and tests results. Creation of weekly tests reports will continue requiring manual intervention.

  • The test definitions for public LAVA jobs are publicly visible. The Jenkins webhook URL will also be visible in these tests definitions, which can be a security concern. A solution for this issue is proposed in Security Improvement.

  • Closing and verifying tasks will still require manual intervention due to the points explained in the Considerations section.

New CI Infrastructure and Workflow

The main changes in the new infrastructure is that test results and test cases will be stored in SQUAD and Git respectively, and there will be mechanisms in place to visualise test results and send test issues notifications. The new infrastructure is defined at the test data storage document.

Manual tests are processed by the new infrastructure, so the new workflow will also cover closing the CI loop for manual tests.

Components and Workflow

A new web service can be setup to receive the callback triggered by LAVA in a specific URL in order to fetch the automated tests results, instead of using the Jenkins webhook plugin. This is in case that using the Jenkins webhook might turn out not to be a suitable solution during implementation either for the current CI loop infrastructure or for the new one.

Therefore the following steps will use the term Tests Processor System to refer to the infrastructure in charge of receiving and processing these test results, and which can be setup either in Jenkins or as a new infrastructure service.

The main components for the new infrastructure can be broadly split into the following phases: automated tests, manual tests, tasks management, and reporting and visualization.

Automated Tests

Workflow for automated tests:

  • Jenkins build images and trigger LAVA jobs.
  • LAVA execute automated tests jobs and results are saved in its database.
  • LAVA triggers an user notification callback attaching test job information and results to send to the tests processor system.
  • The system opens a HTTP URL to wait for the LAVA callback in order to receive tests results.
  • Tests results are received by the tests processor system.
  • Once test results are received, these are processed with the tool to convert the test data into SQUAD format.
  • After the data is in the correct format, it is sent to SQUAD using the HTTP API.

Manual Tests

Test results will be entered manually by the tester using a new application, in this workflow named Test Submitter Application.

This application will prompt the tester to enter each manual test results, and will send the data to the SQUAD backend, as explained in the test data storage document.

The following workflow includes the processing of manual test results into the CI loop:

  • Tester manually executes test cases.
  • Tester enters test results into the test submitter application.
  • The application sends the test data to the tests processor system using a reliable network protocol.
  • Tests results are received by the tests processor system infrastructure.
  • Once test results are received, these are processed with the tools to convert the test data into SQUAD format.
  • After the data is in the correct format, it is sent to SQUAD using the HTTP API.

Tasks Management

This phase deals with processing the test results data in order to file and manage Phabricator tasks and send notifications.

  • Once all tests results are stored in the SQUAD backend, they might still need to be processed by other phases in the tests processor system, and sent to a new phase to manage Phabricator tasks.
  • The new Phabricator phase uses the tests data to file new tasks following the logic explained in the Tasks Management section.
  • The same phase or a new one could notify results to mattermost or via email.

Reporting and Visualization

A new web application dashboard will be used to view test results and generate reports and graphical statistics.

This web application will fetch results from the SQUAD backend and will process them to generate the relevant statistic and graphics.

The weekly test report will be generated either periodically or at any time as needed using this web application dashboard.

More details can be found in the reporting and visualization document.

General Workflow Overview

This section gives an overview about the complete workflow in the following steps:

  • Automated tests and manual tests are executed in different environments.

  • Automated tests are executed in LAVA and results are sent to the HTTP URL service open by the Tests Processor System to receive the LAVA callback sending the tests results.

  • Manual tests are executed by the tester. The tester uses the Test Submitter App to collect tests results and send them to the Tests Processor System using a reliable network protocol for data transfer.

  • All test results are processed and converted to the SQUAD JSON format by the Test Processor and Analyzer.

  • Once test results are in the correct format, they are sent to the SQUAD backend using the SQUAD HTTP API.

  • Test results might still need to be processed by the Test Processor and Analyzer in order to be sent to the new phases. Once results are processed, these are passed to the Task Manager and Notification System phases to manage Phabricator tasks and send email or mattermost notifications respectively.

  • From the SQUAD backend, the new Web Application Dashboard fetches tests results periodically or as needed to generate test results views, graphical statistics, and reports.

The following diagram shows the visual for the above workflow:

New Infrastructure Migration Steps

  • Setup a SQUAD instance. This can be done using a Docker image, so the setup should be very straightforward and convenient to replicate downstream.

  • Extend the current Test Processor System to submit results to SQUAD. This basically consists on using the SQUAD URL API to submit the test data.

  • Convert the testcases from the wiki format to the strictly defined YAML format.

  • Write an application to render the YAML testcases, guide testers through them and provide them a form to submit their results. This is the Test Submitter App and can be developed as either a web frontend or a command line tool.

  • Write the reporting web application which fetches results from SQUAD and renders reports. This is the Web App Dashboard and it will be developed using existing modules and frameworks in a convenient way such a deployment and maintenance can be done in the same way than other infrastructure services.

Maintenance Impact

The new components required for the new infrastructure are the Test Submitter, Web Application Dashboard and SQUAD, along with some changes needed for the Test Processor System to receive the manual tests results and send test data to SQUAD.

SQUAD is an upstream dashboard that can be deployed using Docker, so it can be conveniently used by other projects and its maintenance effort won't be more than other infrastructure services.

The test submitter and web application dashboard will be developed reusing existing modules and frameworks for each of their functionalities, they mainly need to use already well defined APIs to interact with the rest of the services, and they will be designed in such a way that can be conveniently deployed (for example, using Docker). They are not expected to be large applications, so maintenance should be equal to other tools in the project.

The test processor is a system tool, developed in a modular way, so each component can reuse existing modules or libraries to implement the required functionality, for example, make use of an existing HTTP module to access the SQUAD URL API, so it won't require a big maintenance effort and it will practically be the same than other infrastructure tools in the project.

Converting the test cases to the new YAML format can be done manually, and a small tool can be used to assist with the format migration (for example, to sanitize the format). This should be a one time task, so no further maintenance is involved.

Links

LAVA Notification Callback :

  • https://lava.collabora.co.uk/static/docs/v2/user-notifications.html#notification-callback

Jenkins Webhook Plugin:

  • https://wiki.jenkins.io/display/JENKINS/Webhook+Step+Plugin

Phabricator API:

  • https://phabricator.apertis.org/conduit

Mattermost Jenkins Plugin:

  • https://wiki.jenkins.io/display/JENKINS/Mattermost+Plugin

The results of the search are