Troubleshooting

As you wire up data ingestion in Materialize, you might run into some snags or unexpected scenarios. This guide collects common questions around data ingestion to help you troubleshoot your sources. If you’re looking for troubleshooting guidance for slow or unresponsive queries, check out the Transform data troubleshooting guide instead.

💡 Tip: For help getting started with your own data, you can schedule a free guided trial.

Why isn’t my source ingesting data?

First, check the status of your source in the Materialize console by navigating to https://console.materialize.com/, clicking the Sources tab in the navigation bar, and clicking the affected source.

Alternatively, you can get this information from the system catalog by querying the mz_source_statuses table:

SELECT * FROM mz_internal.mz_source_statuses
WHERE name = <SOURCE_NAME>;

Status	Description/recommendation
`paused`	Source is running on a cluster with 0 replicas. To resolve this, increase the replication factor of the cluster.
`stalled`	You likely have a configuration issue. The returned `error` field will provide more details.
`failed`	You likely have a configuration issue. The returned `error` field will provide more details.
`starting`	If this status persists for more than a few minutes, reach out to our team for support.
`running`	If your source is in a `running` state but you are not receiving data when you query the source, the source may still be ingesting its initial snapshot. See Has my source ingested its initial snapshot?.

Has my source ingested its initial snapshot?

While a source is snapshotting, the source (and the associated subsources) cannot serve queries. That is, queries issued to the snapshotting source (and its subsources) will return after the snapshotting completes (unless the user breaks out of the query).

Snapshotting can take anywhere from a few minutes to several hours, depending on the size of your dataset, the upstream database, the number of tables (more tables can be parallelized in Postgres), and the size of your ingestion cluster.

We’ve observed the following approximate snapshot rates from PostgreSQL:

Cluster Size	Snapshot Rate
25 cc	~20 MB/s
100 cc	~50 MB/s
800 cc	~200 MB/s

To determine whether your source has completed ingesting the initial snapshot, you can query the mz_source_statistics system catalog table:

SELECT snapshot_committed
FROM mz_internal.mz_source_statistics
WHERE id = <SOURCE_ID>;

You generally want to aggregate the snapshot_committed field across all worker threads, as done in the above query. The snapshot is only considered committed for the source as a whole once all worker threads have committed their components of the snapshot.

Even if your source has not yet committed its initial snapshot, you can still monitor its progress. See How do I monitor source ingestion progress?.

How do I speed up the snapshotting process?

We’ve observed the following approximate snapshot rates from PostgreSQL:

Cluster Size	Snapshot Rate
25 cc	~20 MB/s
100 cc	~50 MB/s
800 cc	~200 MB/s

To speed up the snapshotting process, you can scale up the size of the cluster used for snapshotting, then scale it back down once the snapshot completes.

ALTER CLUSTER <cluster_name> SET ( SIZE = <new_size> );

NOTE: Resizing a cluster that hosts sources requires the cluster to restart. This operation incurs downtime for the duration it takes for all objects in the cluster to hydrate.

Once the initial snapshot has completed, you can resize the cluster for steady state.

For upsert sources, a larger cluster can not only speed up snapshotting, but may also be necessary to support increased memory usage during the process. For more information, see Use a larger cluster for upsert source snapshotting.

How do I monitor source ingestion progress?

Repeatedly query the mz_source_statistics table and look for ingestion statistics that advance over time:

SELECT
    bytes_received,
    messages_received,
    updates_staged,
    updates_committed
FROM mz_internal.mz_source_statistics
WHERE id = <SOURCE_ID>;

You can also look at statistics for individual worker threads to evaluate whether ingestion progress is skewed, but it’s generally simplest to start by looking at the aggregate statistics for the whole source.

The bytes_received and messages_received statistics should roughly match the external system’s measure of progress. For example, the bytes_received and messages_received fields for a Kafka source should roughly match the upstream Kafka broker reports as the number of bytes (including the key) and number of messages transmitted, respectively.

During the initial snapshot, updates_committed will remain at zero until all messages in the snapshot have been staged. Only then will updates_committed advance. This is expected, and not a cause for concern.

After the initial snapshot, there should be relatively little skew between updates_staged and updates_committed. A large gap is usually an indication that the source has fallen behind, and that you likely need to scale it up.

messages_received does not necessarily correspond with updates_staged and updates_commmited. For example, a source with ENVELOPE UPSERT can have more updates than messages, because messages can cause both deletions and insertions (i.e. when they update a value for a key), which are both counted in the updates_* statistics. There can also be fewer updates than messages, as many messages for a single key can be consolidated if they occur within a (small) internally configured window. That said, messages_received making steady progress while updates_staged/updates_committed doesn’t is also evidence that a source has fallen behind, and may need to be scaled up.

Beware that these statistics periodically reset to zero, as internal components of the system restart. This is expected behavior. As a result, you should restrict your attention to how these statistics evolve over time, and not their absolute values at any moment in time.