Introducing Workload Capture & Replay

When customers hit issues in production, it can be an effort to locally reproduce them, especially when external sources are involved. Reproducing issues is useful not just to figure out the root cause, but also to verify the fix and add a regression test. The newly introduced workload capture & replay tooling records a Materialize instance's state as well as recent queries and ingestion rates, then replays them in a Docker Compose environment with synthetic data. In this blog post I’ll show how it works and talk about some of the challenges and future work.

Architecture Diagram

Capturing

In this example we are running the Materialize Emulator locally (see related blog post):

1
$ docker pull materialize/materialized:latest
2
$ docker run -it --name materialized -p 127.0.0.1:6874-6878:6874-6878 -e \
3
  'MZ_SYSTEM_PARAMETER_DEFAULT=enable_statement_lifecycle_logging=true;statement_logging_default_sample_rate=1;statement_logging_max_sample_rate=1'\
4
  materialize/materialized:latest
5
$ psql postgres://materialize@127.0.0.1:6875/materialize
6
materialize=> CREATE TABLE some_table (full_name text, age_in_years int);
7
CREATE TABLE
8
materialize=> INSERT INTO some_table VALUES ('Matty', 100);
9
INSERT 0 1
10
materialize=> CREATE MATERIALIZED VIEW some_mv AS SELECT full_name FROM some_table;
11
CREATE MATERIALIZED VIEW
12
materialize=> CREATE VIEW some_view AS SELECT sum(age_in_years) AS total_years FROM some_table;
13
CREATE VIEW
14
materialize=> CREATE DEFAULT INDEX ON some_view;
15
CREATE INDEX
16
materialize=> SELECT * FROM some_view;
17
 total_years
18
-------------
19
         100
20
(1 row)
21
materialize=> SELECT *, 'some literal' FROM some_mv;
22
 full_name |   ?column?
23
-----------+--------------
24
 Matty     | some literal
25
(1 row)
batchfile

Capturing a workload is simple. Check out the Materialize repository and run it against the system user’s 6877 port:

1
$ bin/mz-workload-capture \
2
  postgres://mz_system:materialize@127.0.0.1:6877/materialize
3
Fetching clusters                                     [   0.00s]
4
Fetching databases                                    [   0.00s]
5
Fetching schemas                                      [   0.00s]
6
Fetching data types                                   [   0.00s]
7
Fetching connections                                  [   0.00s]
8
Fetching sources                                      [   0.01s]
9
Fetching subsources                                   [   0.01s]
10
Fetching tables                                       [   0.04s]
11
Fetching views                                        [   0.01s]
12
Fetching materialized views                           [   0.01s]
13
Fetching sinks                                        [   0.00s]
14
Fetching indexes                                      [   0.01s]
15
Fetching queries                                      [   0.60s]
16
Fetching source/subsource/table statistics            [   0.00s]
17
Writing workload to workload_2026-02-06T12-08-57.yml  [   0.00s]
batchfile

Since our Materialize instance has so few objects, the state is quickly captured. By default the last 360 seconds of queries are captured, but you can also specify --time 3600 for an hour for example. The output is a YAML workload file:

1
clusters:
2
  quickstart:
3
    create_sql: CREATE CLUSTER "quickstart" (INTROSPECTION DEBUGGING = false, INTROSPECTION
4
      INTERVAL = INTERVAL '00:00:01', MANAGED = true, REPLICATION FACTOR = 1, SIZE
5
      = '800cc', SCHEDULE = MANUAL)
6
    managed: true
7
databases:
8
  materialize:
9
    public:
10
      connections: {}
11
      indexes:
12
        some_view_primary_idx:
13
          create_sql: 'CREATE INDEX some_view_primary_idx
14

15
            IN CLUSTER quickstart
16

17
            ON materialize.public.some_view (total_years);'
18
      materialized_views:
19
        some_mv:
20
          columns:
21
          - default: null
22
            name: full_name
23
            nullable: true
24
            type: text
25
          create_sql: "CREATE MATERIALIZED VIEW materialize.public.some_mv\n    IN
26
            CLUSTER quickstart\n    WITH (REFRESH = ON COMMIT)\n    AS SELECT full_name
27
            FROM materialize.public.some_table;"
28
      sinks: {}
29
      sources: {}
30
      tables:
31
        some_table:
32
          columns:
33
          - default: 'NULL'
34
            name: full_name
35
            nullable: true
36
            type: text
37
          - default: 'NULL'
38
            name: age_in_years
39
            nullable: true
40
            type: integer
41
          create_sql: CREATE TABLE materialize.public.some_table (full_name pg_catalog.text,
42
            age_in_years pg_catalog.int4);
43
          id: u1
44
          rows: 1
45
      types: {}
46
      views:
47
        some_view:
48
          columns:
49
          - default: null
50
            name: total_years
51
            nullable: true
52
            type: bigint
53
          create_sql: "CREATE VIEW\n    materialize.public.some_view\n    AS SELECT
54
            pg_catalog.sum(age_in_years) AS total_years FROM materialize.public.some_table;"
55
mz_workload_version: 1.0.0
56
queries:
57
- began_at: 2026-02-06 12:08:50.038000+00:00
58
  cluster: quickstart
59
  database: materialize
60
  duration: 0.013
61
  finished_status: success
62
  params: []
63
  result_size: 37
64
  search_path:
65
  - public
66
  sql: SELECT *, 'some literal' FROM some_mv
67
  statement_type: select
68
  transaction_isolation: strict serializable
yaml

For the sake of brevity I have removed all but the last query. This was of course a pretty simple setup, but it shows us the most basic functionality of mz-workload-capture. The definitions and metadata of objects are extracted, as well as the queries run during the specified time. For tables we have statistics about how many rows there, but not their actual contents.

The capture tool leverages the introspection views that the Materialize Console is using to show source/sink statistics, as well as the Query History.

What’s been missing in this example are the things actually making Materialize interesting: Ingesting data from large PostgreSQL, MySQL, SQL Server & Kafka sources as well as through Webhooks, and Kafka sinks. But fear not, all of the above are supported by mz-workload-capture as well. This is how a PostgreSQL source looks for example:

1
sources:
2
  pg_cdc:
3
    bytes_second: 691.7790633608815
4
    bytes_total: 685433819
5
    children:
6
      qa_canary_environment.public_pg_cdc.pg_people:
7
        bytes_second: 498.3641873278237
8
        bytes_total: 493694651
9
        columns:
10
        - default: null
11
          name: id
12
          nullable: false
13
          type: integer
14
        [...]
15
        create_sql: "CREATE TABLE\n    qa_canary_environment.public_pg_cdc.pg_people\n
16
          \       (\n            id pg_catalog.int4 NOT NULL,\n            name
17
          pg_catalog.text,\n            incarnation pg_catalog.int4,\n            CONSTRAINT
18
          people_pkey PRIMARY KEY (id)\n        )\nFROM SOURCE qa_canary_environment.public_pg_cdc.pg_cdc
19
          (REFERENCE = postgres.public.people)\nWITH (\n    DETAILS = '0a7c0a7a0a0670656f706c6512067075626c696318b7d00a22130a026964101718ffffffffffffffffff01300122170a046e616d65101918ffffffffffffffffff0120013002221e0a0b696e6361726e6174696f6e101718ffffffffffffffffff01200130032a1608bfd00a120b70656f706c655f706b65791a01012001'\n);"
20
        database: qa_canary_environment
21
        id: u87088
22
        messages_second: 6.714325068870523
23
        messages_total: 6621734
24
        name: pg_people
25
        schema: public_pg_cdc
26
        type: table
27
      [...]
28
    columns:
29
    - default: null
30
      name: lsn
31
      nullable: true
32
      type: uint8
33
    create_sql: 'CREATE SOURCE qa_canary_environment.public_pg_cdc.pg_cdc
34

35
      IN CLUSTER qa_canary_environment_storage
36

37
      FROM POSTGRES CONNECTION qa_canary_environment.public.pg (PUBLICATION
38
      = ''mz_source'');'
39
    id: u87073
40
    messages_second: 10.743801652892563
41
    messages_total: 10616300
42
    type: postgres
yaml

As can be seen for sources we have statistics about the total number of messages as well as how many are ingested during a time period.

Replaying

Now we’re getting to the most interesting part: Actually replaying a workload capture file for 1 hour, with 1% of the initial data synthetically generated, and the full amount of queries and ingestions happening during the continuous phase:

1
$ bin/mzcompose --find workload-replay run default \
2
  --runtime=3600 --verbose workload_ddnet.yml \
3
  --factor-initial-data=.01 --factor-queries=1 --factor-ingestions=1
batchfile

Under the hood this sets up a local Docker Compose setup containing all the required services, which always includes the Materialize emulator (materialized), and depending on the sources/sinks in the workload file Kafka, PostgreSQL, MySQL and SQL Server. This means we are currently limited to workloads that can fit on a single machine. Then we create all the specified objects: Clusters, databases, schemas, types, connections, sources, tables, views, materialized views, sinks and indexes.

All connections to external sources are automatically rewritten to target the instances we are running inside of Docker Compose instead of the original systems. The replayer runs in total isolation from the outside world, and sets up everything it needs itself.

As there can be dependencies between objects, the order of creation is important. For views and materialized views there can be dependencies between objects. One solution would be to create a graph and create them in a supported ordering. Instead we chose to retry failed object creations after having created all the other objects, since a failed CREATE DDL is cheap.

After everything is initialized workload-replay generates synthetic data in external sources as specified in each source/subsource/table in Materialize, as well as fills up Materialize-native tables and webhooks. The amount of data can be varied by using --factor-initial-data, defaulting to 1.0, meaning we generate as many rows/messages as are recorded in the original Materialize. Before we can continue we have to wait for Materialize to hydrate all its objects.

Care was taken to make the data generation fast, using COPY FROM STDIN for Postgres/Materialize instead of INSERT, as well as asynchronous data production for Kafka and Webhooks. In our CI we are seeing about 20k rows/s for PostgreSQL sources, 10k rows/s for Kafka, and 3k rows/s for Webhooks. The exact speed depends on the source definition and what views, indexes and materialized views depend on the ingested data, since we start hydrating them during the initial ingestion by default.

The synthetic data itself is generated with a long-tail distribution, which is something that’s often seen in real data.

Finally we have the continuous phase, which in parallel replays data ingestions scaled by --factor-ingestions and queries scaled by --factor-queries. Failing queries and too slow ingestions and queries are logged in the end:

1
Queries:
2
   Total: 403
3
  Failed: 0 (0%)
4
    Slow: 5 (1%)
5
Ingestions:
6
   Total: 46
7
  Failed: 0 (0%)
8
    Slow: 0 (0%)
yaml

Regression Tests & Benchmarks

In CI we have a collection of captured workloads and run it against the previous Materialize version compared to the current state. When a query has new errors we can report them as a regression in the new Materialize version and fail the test:

Error

Similarly we can compare the performance between Materialize versions, both for CPU and memory as well as the initial data phase and continuous phase:

Initial Data Phase CPU
Initial Data Phase Memory
Continous Phase CPU
Continous Phase Memory

Worse performance is detected automatically and would cause the test in CI to fail:

1
$ bin/mzcompose --find workload-replay run benchmark
2
METRIC                   |     OLD      |     NEW      |  CHANGE   | THRESHOLD | REGRESSION?
3
--------------------------------------------------------------------------------------------
4
Object creation (s)      |       15.949 |       15.762 |     -1.2% |       20% |      no
5
Data ingestion time (s)  |      901.443 |      911.124 |     +1.1% |       20% |      no
6
Data ingestion CPU (sum) |   949763.388 |   946471.517 |     -0.3% |       20% |      no
7
Data ingestion Mem (sum) |    21960.639 |    22134.356 |     +0.8% |       20% |      no
8
CPU avg (%)              |      596.030 |      529.256 |    -11.2% |       20% |      no
9
Mem avg (%)              |       47.639 |       43.626 |     -8.4% |       20% |      no
10
Query max (ms)           |  1764795.699 |    11374.836 |    -99.4% |           |
11
Query min (ms)           |        0.430 |        0.349 |    -18.9% |           |
12
Query avg (ms)           |   258871.898 |      977.998 |    -99.6% |           |
13
Query p50 (ms)           |   164878.947 |      504.100 |    -99.7% |           |
14
Query p95 (ms)           |  1076744.176 |     3240.775 |    -99.7% |           |
15
Query p99 (ms)           |  1555137.942 |     5651.330 |    -99.6% |           |
16
Query std (ms)           |   332962.336 |     1167.276 |    -99.6% |           |
batchfile

In this example we had a nice optimization causing query times to improve significantly for this workload.

Care is taken to run benchmarks against both Materialize versions with the same seed, and make sure a separate RNG is used for each thread. This ensures that the same random data is generated for data ingestions, and the same queries are executed.

Statistics

Workloads of production systems can be huge, so just looking at them manually can be daunting. We can print some statistics instead:

1
$ bin/mzcompose --find workload-replay run stats
2
workload_prod_sandbox.yml
3
  size                   1.9 MiB
4
  clusters                     7
5
  databases                    7
6
  schemas                     21
7
  data types                   0
8
  tables                      34
9
  connections                 13
10
  sources                     11
11
    kafka                      2
12
    load-generator             5
13
    mysql                      1
14
    postgres                   2
15
    webhook                    1
16
  subsources                  23
17
  views                       12
18
  mat. views                  12
19
  indexes                     15
20
  sinks                        4
21
    kafka                      4
22
  rows               350,207,350
23
    /s                    255.98
24
  queries                  4,761
25
    span                   60min
26
    last              2026-01-25
batchfile

Diffing

With a YAML diffing tool like dyff you can get reasonable results for workload files. This allows you to tell the difference between two states of a Materialize instance, making it easier to figure out what changed and caused the different behaviors you might be seeing:

1
$ dyff between -b workload_2026-01-27T14-11-02.yml workload_2026-01-28T09-08-12.yml
2

3
databases.materialize.public.sources.record_race.bytes_second
4
  ± value change
5
    - 308.3648871766648
6
    + 151.4994481236203
7

8
databases.materialize.public.sources.record_race.bytes_total
9
  ± value change
10
    - 31323976680
11
    + 31340458793
12

13
databases.materialize.public.sources.record_race.messages_second
14
  ± value change
15
    - 0.6194276279581729
16
    + 0.304083885209713
17

18
databases.materialize.public.sources.record_race.messages_total
19
  ± value change
20
    - 63161896
21
    + 63199673
22

23
databases.materialize.public.sources.record_teamrace
24
  + two map entries added:
25
    bytes_second: 5.649834437086093
26
    messages_second: 0.018211920529801324
batchfile

Anonymizing

When you’re asking someone to hand you a workload yaml file, they can of course inspect whether it contains any information they don’t want to share, be it an identifier, literal in queries, or a default in a table.

We also have an initial simple anonymizer implemented, which currently works on a best-effort basis, as it doesn’t properly parse the SQL queries and reconstruct them (yet):

1
$ bin/mz-workload-anonymize workload_2026-02-03T13-11-03.yml
batchfile

After the modification the workload looks like this, the changed parts have been marked in green:

1
clusters:
2
  cluster_0:
3
    create_sql: CREATE CLUSTER "cluster_0" (INTROSPECTION DEBUGGING = false, INTROSPECTION
4
      INTERVAL = INTERVAL '00:00:01', MANAGED = true, REPLICATION FACTOR = 1, SIZE
5
      = '800cc', SCHEDULE = MANUAL)
6
    managed: true
7
databases:
8
  materialize:
9
    public:
10
      connections: {}
11
      indexes:
12
        index_1:
13
          create_sql: 'CREATE INDEX index_1
14

15
            IN CLUSTER cluster_0
16

17
            ON materialize.public.view_1 (column_3);'
18
      materialized_views:
19
        mv_1:
20
          columns:
21
          - default: null
22
            name: column_4
23
            nullable: true
24
            type: text
25
          create_sql: "CREATE MATERIALIZED VIEW materialize.public.mv_1\n    IN CLUSTER
26
            cluster_0\n    WITH (REFRESH = ON COMMIT)\n    AS SELECT column_4 FROM
27
            materialize.public.table_1;"
28
      sinks: {}
29
      sources: {}
30
      tables:
31
        table_1:
32
          columns:
33
          - default: 'NULL'
34
            name: column_1
35
            nullable: true
36
            type: text
37
          - default: 'NULL'
38
            name: column_2
39
            nullable: true
40
            type: integer
41
          create_sql: CREATE TABLE materialize.public.table_1 (column_4 pg_catalog.text,
42
            column_2 pg_catalog.int4);
43
          id: u1
44
          rows: 1
45
      types: {}
46
      views:
47
        view_1:
48
          columns:
49
          - default: null
50
            name: column_3
51
            nullable: true
52
            type: bigint
53
          create_sql: "CREATE VIEW\n    materialize.public.view_1\n    AS SELECT pg_catalog.sum(column_2)
54
            AS column_3 FROM materialize.public.table_1;"
55
mz_workload_version: 1.0.0
56
queries:
57
- began_at: 2026-02-06 12:08:50.038000+00:00
58
  cluster: cluster_0
59
  database: materialize
60
  duration: 0.013
61
  finished_status: success
62
  params: []
63
  result_size: 37
64
  search_path:
65
  - public
66
  sql: SELECT *, 'literal_3' FROM mv_1
67
  statement_type: select
68
  transaction_isolation: strict serializable
yaml

As you can see the user-specified identifiers as well as literals have been replaced with non-descriptive ones like table_1, mv_1, literal_1 etc.

Future Work

We have an initial set of workloads that serve as a foundation for internal testing. Expanding the captured workloads would further increase our confidence in Materialize and provide additional assurance to customers by reducing the risk of regressions in their specific use cases.

Today we capture some basic statistics about real data, primarily row counts and total bytes, and we also support collecting average column sizes when needed. Extending the statistics collection would allow us to generate synthetic data whose distributions more closely reflect real-world workloads.

Incorporating real samples, or even full data, would open the door to validating correctness in addition to performance, while also making replayed computations more representative. Achieving this would involve closer integration of the capture tooling into Materialize itself, while the current approach relies only querying Materialize’s introspection views.

We currently don’t support replaying creating a Kafka sink to write out data into a topic, and then reading the same topic back into Materialize through a Kafka source. Instead two separate topics will be used by the workload replay tool.

Replayable workload size is currently bounded by what can be executed on a single machine. Supporting distributed replay against both Materialize Self-managed and Materialize Cloud would significantly broaden the scope of testable workloads, with the main challenge being automated setup of the required external sources.

Finally, evolving the anonymization tool to use a full SQL parser and serializer would make identifier replacement more robust and reliable, since we are currently reliant on some stored CREATE statements instead of generating them dynamically.

Conclusion

Creating test cases manually can be challenging, especially when trying to reproduce problems occurring in large Materialize instances with many external systems involved. The newly introduced Workload Capture & Replay tooling simplifies this significantly and allows us to find regressions earlier in the process. Get in touch with us if you are a customer and interested in supplying a captured workload for testing! The source code of the Workload Capture & Replay tooling is available in our Materialize GitHub repository.