Continuous Integration in NiFi

Let's dive into what can be done for Continuous Integration when it comes to NiFi. If you skipped the introduction, by Continuous Integration we are talking about:

Automated Testing: Developers frequently commit the changes made on their flows into a shared repository. Each commit may trigger an automated build and testing process to catch issues early.
Early Detection of Bugs: Since flow changes are integrated continuously, errors can be identified and fixed quickly, reducing the risk of bugs building up in long development cycles.
Collaboration: CI encourages collaboration by ensuring that new changes integrate seamlessly with the existing codebase.

As we discussed before, in the NiFi world, a given use case may be a combination of three things:

The JSON flow definition that represents the NiFi Flow for the use case
(optional) One or many custom components packaged and deployed as NAR files
(optional) A set of parameters and/or configuration assets that are required for running the flow

In the rest of this documentation, we will cover the Automated Testing and the Early Detection of Bugs as one single aspect. Here is a table of what is covered, as of today, in this documentation:

	JSON Flow Definition	Custom Components	Parameters/Assets
Collaboration	X	X	X
Automated Testing

info

This page will evolve as Datavolo builds and releases new features for NiFi.

Flow Definitions - Collaboration

Registry Client for Flow Versioning

At Datavolo, collaboration on the Flow Definitions is done by the use of registry clients directly connecting to code repositories. We currently provide two options:

GitHub Registry Client
GitLab Registry Client

The idea of a Registry Client is to connect NiFi directly to a repository and use the repository as a way to store and version flow definitions. A Registry Client can be added and configured by going into NiFi Global Menu > Controller Settings > Registry Clients.

caution

Note that a flow definition is versioned as a Process Group and Datavolo discourages nested versioning of flows (ie. Process Group A contains Process Group B and both Process Groups A & B are individually versioned in the repository).

A demo is available on our Youtube Channel where the GitHub Registry Client is used as an example to understand the flow versioning capabilities with NiFi.

tip

In a Registry, there are some key concepts:

bucket This is a way to logically group flow definitions that are serving the same purpose or to separate flows between different teams, etc.
flow This is to represent a flow definition that implements a given use case. A flow can have many versions. Depending on the implementation of the Registry Client, the version can be a sequence (1, 2, ...) or a String (commit ID for example).
branch Some Registry Clients support the concept of branching coming from git. You can see below for more details around branching.

Branching

Let's consider a relatively simple approach for branching:

There is a main branch for what is deployed in the production environment
There is a dev branch for the working being done in the development environment
Additional feature-xxx branches are created when working on the new feature of the flow

A fairly common strategy is to create a feature branch from the dev branch when a new feature needs to be developed in an existing flow. This allows for multiple individuals to work on different features for the same flow, at the same time.

To do that, the following steps would be executed:

In the code repository used by the configured Registry Client, the user would go create the feature branch.
Once done, the user can go back into the NiFi UI and import the flow from the feature branch as a new process group.
At this point, the user can go in the process group and start working on the changes and commit changes whenever required.

Once the final changes for the feature are done and the last commit landed on the feature branch, it would be time to go into the code repository and open a pull request from the feature branch against the dev branch.

note

Similar to traditional code repositories, a good branching strategy is key to make sure that, while new features can be implemented in the dev branch, it is also possible to implement quick bug fixes against the main branch that represents what is deployed in production. In that case, a specific bug-fix branch would be created from the main branch, the flow would be checked out in NiFi using that bug-fix branch, changes would be made, committed and then changes would be merged into main via a pull request.

A demo is available on our Youtube Channel where we discuss the concept of branching with the GitHub Registry Client.

Pull Requests and reviewing changes

In NiFi, a flow definition is a JSON file and comparing two JSON files might not be the easiest thing to do. When opening a pull request, it is expected from the author to clearly describe the changes that are submitted but the reviewers of the pull request would also look at the differences between the two JSON files to accept or not the submitted changes.

Datavolo Flow Diff Github Action

To help with reviewing changes, Datavolo provides a GitHub Action that will compare the two flow definitions and automatically add a comment to the pull request with a human readable description of the changes.

To configure this Github Action, create a file .github/workflows/flowdiff.yml in the repository used to version your NiFi Flow definitions. Below is the content of the file:

name: Datavolo Flow Diff on Pull Requests
on:
  pull_request:
    types: [opened, reopened, synchronize]

jobs:
  execute_flow_diff:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    name: Executing Flow Diff
    steps:
      # checking out the code of the pull request (merge commit - if the PR is mergeable)
      - name: Checkout PR code
        uses: actions/checkout@v4
        with:
          path: submitted-changes

      # getting the path of the flow definition that changed (only one expected for now)
      - id: files
        uses: Ana06/get-changed-files@v1.2

      # checking out the code without the change of the PR
      - name: Checkout original code
        uses: actions/checkout@v4
        with:
          fetch-depth: 2
          path: original-code
      - run: cd original-code && git checkout HEAD^

      # Running the diff
      - name: Datavolo Flow Diff
        uses: datavolo-io/datavolo-flow-diff@v0
        id: flowdiff
        with:
          flowA: 'original-code/${{ steps.files.outputs.all }}'
          flowB: 'submitted-changes/${{ steps.files.outputs.all }}'

Custom Components - Collaboration

By design, NiFi is extremely extensible and it is very easy to build your own components to suit your specific needs and use cases.

When it comes to collaborate on code, the usual solutions apply. The code of the components would be pushed and versioned in git-based repositories. Unit tests would be implemented and a CI pipeline would be deployed to execute the unit tests. When a new version of the component is ready, a tag would be created and it would trigger another pipeline to package the component as a NAR (NiFi Archive File). At this point, the NAR file can be published to an artefact repository (Maven, Nexus, etc) or directly pushed to NiFi (see Continuous Deployment for more details).

Java (and other JVM based languages)

While Java is the preferred language for writing custom components, it is also possible to use languages such as Groovy, Scala, etc. Please refer to the developer guide to get started.

Python

With NiFi 2, it is possible to build custom processors using NiFi. Please look at our documentation and our resources in the Dev Center to know more about building custom components in Python.

Parameters & Assets - Collaboration

The best way to parameterise a flow definition is to leverage the concept of parameters in NiFi. Anything in the flow definition that may need to change based on the environment where the flow definition is deployed should be configured as a parameter.

Parameters are grouped into a Parameter Context and there is a one-to-one mapping between a Parameter Context and a Process Group. However it is possible to define inheritance between Parameter Contexts so that the Parameters of multiple Parameters Contexts can be referenced in the components of a Process Group. More information about parameters can be found here, as well as about parameter context inheritance here.

Parameter Providers

Parameter Providers are a specific extension type in NiFi for which many implementations are available (see here). A Parameter Provider is a component that is connecting to an external system to fetch parameters from and to create one or many Parameter Context(s) from the fetched parameters.

important

Parameter Providers are defined at controller level and, as such, are not part of a flow definition. The expectation is that Parameter Providers are configured and instantiated by NiFi admins in every environment prior to deploying flows. The flows can then reference the Parameters Contexts and associated Parameters coming from the Parameter Providers.

Assets

When a Parameter is used in a component's property that is expecting a file, it is possible to directly upload those assets into NiFi and have those referenced by the flow through parameters. A typical example is the use of a JDBC driver JAR file in the DBCP Connection Pool controller service.

You can find a Youtube video on how assets are managed at Datavolo and check out this tutorial in our Dev Center.

Flow Definitions - Collaboration​

Registry Client for Flow Versioning​

Branching​

Pull Requests and reviewing changes​

Datavolo Flow Diff Github Action​

Custom Components - Collaboration​

Java (and other JVM based languages)​

Python​

Parameters & Assets - Collaboration​

Parameter Providers​

Assets​