In the previous post I talked about how we test Materialize. This time I’ll describe how I significantly sped up our Continuous Integration (CI) Test pipeline in July, especially for pull requests that require a build and full test run. The goal is to make developers more productive by reducing the time waiting for CI to complete.

We always kept CI runtime in mind, but it still slowly crept up over the years through adding tests, the code itself growing larger, as well as hundreds of minor cuts adding up.

This graph shows the CI runtimes for PRs requiring a build and tests. It is still missing my latest changes, since some of them are not merged, and not every PR has been rebased:

Can I make CI 10x faster in July?

The latest state from July 31 is a test run with a minimal recompilation, finishing in 7 minutes, about 7x as fast as this same run would have been on July 1:

The same PR finishes in less than 6 minutes without the build:

In practice build time can vary between 1-9 minutes, so we should now be able to finish a full CI run in 15 minutes at worst. We do have slower tests but those are tucked away in our Nightly (mostly ⟨ 2 hours) and Release Qualification (1 day) pipelines. I’ll go through some of the reasons our CI was slow, and what I did to speed it up.

Pipeline creation

Step Before After Relative Change
mkpipeline.sh 31s 0 -100%
mkpipeline.py 2m 47s 21s -87%

There used to be two mkpipeline scripts, the first to check if we need to bootstrap our ci-builder Docker images, the second to generate the Buildkite pipeline from our template, based on whether a build is required, which tests are relevant to the change, etc. Since bootstrapping was usually unnecessary I added logic to fuse the scripts together into one when possible, which saves some time from not having to schedule the job on an agent.

API calls and external program calls were taking most of the time, and could easily be parallelized. Using the Docker Hub API to check if an image is already available is about 5 times faster than running docker manifest inspect. Caching a list of all known available images locally is even faster of course.

To make sure we have good local caches we now keep an agent around for mkpipeline.

Builds

Step Before After Relative Change
Build x86-64 23m 14s 1m 27s -94%
Build aarch64 23m 29s 1m 20s -94%

Materialize is written in Rust, and compilation is generally slow. Our baseline was using Bazel with its remote caching, which is able to build Materialize in 23 minutes in CI.

For regular test runs we now disable LTO since it adds about 20 minutes to incremental build times, while only making Materialize about 10% faster at runtime. For our actual releases we still use LTO.

Unfortunately Bazel doesn’t work well with Cargo’s incremental compilation, so we switched these specific builds back to cargo as well as to a larger agent. We currently use this Cargo build profile:

[profile.optimized]
inherits = "release"
lto = "off"
debug = 1
incremental = true

Similarly to mkpipeline, keeping an agent with warm caches around helps significantly here.

Most of our CI is using Docker images. Building these Docker images and pushing them to Docker Hub also used to take 5 minutes, by parallelizing and fusing the build and push steps together in a single docker buildx build --push we are now taking about 2 minutes in the worst case of having to push all images.

Since we know what CPUs our CI runs on, we can optimize the binaries further, for example with -Ctarget-cpu=x86-64-v3 -Ctarget-feature=+aes,+pclmulqdq , which allows the Rust compiler to target Intel Haswell, AMD Ryzen or newer CPUs. This helps to counteract the lost performance from disabling LTO.

We already had logic to calculate a hash of all files relevant for a build, so that we don’t have to rebuild on each test run, even if some test-only files have changed.

An option for the future is to not use Docker Hub, but upload the executables to an object store we control ourselves. Only about half of the Docker image size is the actual executable, the rest changes at most once a week when we upgrade our image dependencies. Since our testing design mostly depends on Docker images, we’d have to finish building them locally on the test runner with the executable though, which adds some more overhead. It is not clear if we’d save time doing that.

Lints & cargo test

Step Before After Relative Change
Cargo test 18m 33s 2m 40s / 2m 1s -86%
Merge skew cargo check 7m 16s 7s -98%
Lint and rustfmt 5m 32s 39s -88%
Clippy and doctests 17m 24s 8s / 1m 3s -94%

As with all the other steps so far, keeping dedicated agents around is important so that Cargo caches stay warm in subsequent runs!

One issue here was that cargo exclusively relies on file modification times to determine if a file has changed and needs to be recompiled. This required care since we had a script to clean up the git repository and restore ownership of files, which might have been changed by Docker containers running as root or another user internally. Changing ownership counts as modifying the file, so we had to tone that down to only clean up files that our tests actually write to as another user.

As for cargo test we were already using nextest to speed up our unit tests, I made sure to also use the optimized Cargo build profile instead of the default dev builds without any optimizations. Some individual tests were iterating over dozens of files, so I split them up further so that they can be parallelized better. Most of the tests don’t benefit from the regular builds, since we are not building and uploading the test executables to Docker Hub. An exception are our Cargo tests making use of the clusterd executable. These tests now download the clusterd image when it’s available instead of building it themselves.

I parallelized the Cargo test runs on two agents. Instead of using nextest’s own "--partition=count:{partition}/{total} we switched to determining which package to run via --package=... on which of the agents, which also saves some compile time.

SQL Logic Tests

Step Before After Relative Change
SQL Logic Tests 6x 21m 50s 4x 3m 53s -82%

We have a huge number of SLT files to run through. The main realization here was that our sqllogictest executable mostly runs single-threaded, so we can parallelize it on each CI agent by sharding across all files and run one sqllogictest executable per available CPU core. This required making the prefix for our metadata store configurable so that multiple SLT executables could share a single metadata store.

Other Tests

Step Before After Relative Change
Testdrive 8x 20m 4s 20x 4m 57s -75%
Cluster tests 4x 22m 45s 16x 4m 24s -81%
SSH connection tests 1x 18m 59s 3x 3m 18s -83%
Platform checks 6x 16m 50s 16x 4m 19s -74%
Source/Sink error reporting 2x 23m 32s 3x 3m 45s -84%

Every test was slow for its own reason, execution time for most tests was hovering at 15-25 minutes, now all are able to execute in under 5 minutes. Some highlights:

Hetzner Agent Provisioning

Most of our CI runs on Hetzner with a custom-built autoscaler. It now detects which locations have which machines available to prevent us from uselessly wasting API quota trying to provision agents that won’t come up anyway. After 20 minutes of failing to provision agents we fall back to AWS.

A major step in speeding up the tests was to already provision agents for the tests while the build is still running. The agents can start preparing, downloading the available images. This reduces our preparation time from 4 minutes down to 1 minute on average.

Installing Docker itself on the agents took more than a minute through Fedora’s package manager. Meanwhile downloading the executables directly finishes in a few seconds.

Eat my Data

We are now using libeatmydata across CI and tests. Many actions are filesystem intensive, and we don’t care at all what happens to the data when the agent crashes, since we will never schedule anything on it again. This especially affects our use of PostgreSQL as our metadata store, as well as persisting objects in the blob store.

An easy way to check if a program is correctly using libeatmydata is to grep for the library in /proc/.../maps. For Go applications libeatmydata won’t work because they don’t dynamically link to the C standard library by default. An alternative is running on a tmpfs in memory, or modifying the application code manually to not execute fsync and related syscalls in testing.

You can easily try out the effect when running DDL queries against Materialize:

bash
docker run --env MZ_EAT_MY_DATA=1 -p 127.0.0.1:6875:6875 materialize/materialized:latest
psql postgres://materialize@127.0.0.1:6875
materialize=# \timing
materialize=# CREATE TABLE t (x int);
materialize=# DROP TABLE t;

The effect --env MZ_EAT_MY_DATA=1 has on my system is stark:

CREATE TABLE
Time: 111.492 ms -> Time: 8.773 ms (-92%)
DROP TABLE
Time: 133.021 ms -> Time: 6.504 ms (-95%)

Docker host networking

I expected a large impact from switching from Docker’s bridge networking to host networking, but it didn’t seem to be worth it for most tests. It seems like most of our tests are not network-bound at the moment. Using host networking also causes a bunch of confusion in tests that have many containers running at once, with a risk for port conflicts causing hard-to-debug CI failures. So I have opted not to submit the change, after all keeping CI sane and stable comes first.

Takeaways

Materialize has many features and interacts with many systems:

  1. Applications interact with Materialize using the Postgres protocol
  2. Users additionally run queries using HTTP, WebSockets, and through the MCP server*
  3. Materialize itself is a distributed system with multiple clusters on separate nodes
  4. Two environmentd processes can run at once during a zero-downtime upgrade*
  5. Materialize communicates with PostgreSQL* or CockroachDB and S3 or Azure Blob Storage* services for its internal catalog and storage needs
  6. Data is continuously ingested from a Kafka broker (plus a schema registry), Postgres, MySQL, SQL Server*, Webhooks and Fivetran*
  7. Data is continuously written to a Kafka broker and exported to S3

The stars (*) mark the newly supported systems since my previous post about a year ago. This shows why we have to keep growing our testing efforts and at the same time keep CI runtime low. Since there are so many separate systems at play, using Docker Compose to orchestrate them in testing is a huge benefit. The main takeaways for me are:

  • Keep agents up and their caches warm for easily cacheable tasks, make sure no tool messes with modification times
  • Cargo incremental compilation and disabled LTO are key for fast Rust builds
  • eatmydata or tmpfs for tests involving a lot of safe filesystem interaction (databases, object storage)
  • Start work as early as possible, for us this involved scheduling agents while the build is still running, allowing them to git clone and docker pull as far as possible
  • Follow the cycle of Measure → Optimize/parallelize → Measure until fast enough
  • Set up monitoring for CI runtimes to catch regressions in the future

Get Started with Materialize