The following information is not specific to Materialize (an operational data warehouse often used as a CDC destination.) But, if you’re building something with CDC, Materialize may be useful as a single destination handling multiple use cases. Read why here or register for access to try it out for free.
Who is this for?
This guide assumes a basic understanding of data engineering concepts around databases and pipelines, as well as:
- ✅ Familiarity with Change Data Capture (CDC)
- ✅ Certainty that CDC is the right tool for your specific use-case
It’s geared towards anyone preparing to implement CDC in production in a business.
What to expect
Use this information as a starting point for a production-grade log-based, streaming CDC implementation and operation plan.
We’ve deliberately kept as much as possible generalized to CDC as a design pattern, rather than diving into specifics of one CDC tool. However, the open-source tool Debezium plays such a central role in CDC that we do cite illustrative examples specific to Debezium in later sections.
To avoid confusion in terminology, we use the following terms consistently throughout the rest of this guide:
|Source Database||The database from which you are “capturing” changes.|
|CDC Pipeline||Typically consisting of the CDC Connector that reads the database log and produces change records to a message broker (like Kafka), and a CDC Loader that reads data out to a Destination.|
|Destination||The service (often a database, cache or search index) where the data ends up.|
|CDC Pipeline Operator||Typically a Data Engineer, the person owning implementation of CDC and responsible for ongoing operation, monitoring and maintenance.|
|Source DB Admin||The person or team responsible for the operation of the source database.|
|Application Developers||Anyone who might make changes to the schema of the source database.|
|Destination(s) Operator||Responsible for operation of the destination(s) of CDC pipeline, e.g. the Data Warehouse, often the same team that operates the pipeline.|
|Initial Snapshot||Sometimes called “Bootstrapping”, when CDC is first turned on and the CDC connector must bulk load the current state of the source database into the message broker.|
|Streaming||The ongoing operational state of the CDC pipeline.|
|Re-Snapshot||Or ad-hoc snapshot, or re-bootstrapping, times within Streaming phase where CDC Connector must reload the entire state of some or all of the source database.|
Preparing the Source Database
We’ll start with everything specific to the source database, this is most relevant to the CDC Pipeline Operator and Source DB Admin.
Connecting to Primary or Replica?
Consider your options and trade-offs when deciding whether to connect CDC to the primary or a replica of your source database:
Connecting to primary is not uncommon, the process is similar to creating a new read-replica, and primary is sometimes the only option: (some managed databases like AWS Aurora Postgres don’t give replicas the configurability necessary to enable CDC.)
Connecting to a replica decreases the load on primary, decreases the risk of CDC changes affecting Source DB availability, and maintains CDC if another replica is promoted to primary. But it comes at the cost of higher latency: Updates must propagate through primary to replica before being seen by CDC.
One Connector or Many?
Start your CDC implementation with one connector.
There are typically two ways to expand to multiple connectors, with varying trade-offs for each:
1. Multiple connectors by destination use case
Let’s say you’re using CDC to power an application cache and user-facing analytics. Imagine something catastrophic happens to the user-facing analytics service, and you need to re-snapshot everything. With two services reading the same feed of updates, the application cache will have to handle the deluge of updates from the new snapshot that the user-facing analytics service needed. Separate connectors for each use-case removes risk of re-snapshot activity in one affecting the other, and it gives you greater control to only export the exact data needed for each use-case.
Separate connectors for each use case gives you better control and isolation, but it comes at the cost of higher operational burden, increased load on source database, and increased infra cost, especially network.
2. Multiple connectors by source data
Another approach to scaling connections is separation by source database table:
This gives you the ability to isolate the failure domain by source data: If the schema of one table is changing often, it can be separated out into its own connector, and now breaking changes only stop the feed of a specific table. However, a major negative of this approach is that it can prevent the destination use case from properly handling database transactions.
CDC User/Role Creation
Route all CDC load through a new database role/user with an obvious name (e.g.
cdc_user ) to help admins understand and identify the workload in the future. Make sure the user/role has the appropriate access privileges, and be intentional about whether new tables (created in the future) are automatically accessible.
Database Configuration Changes
CDC typically requires increased logging levels or activation of db extensions, some of these changes require a database restart. With ample advance notice, the configuration change can be made during a planned maintenance window. Validate that the necessary configuration changes won’t break existing services.
Important Decision: When it comes to log retention, many databases give you the ability to choose:
- Let the CDC pipeline fail first - You can set a maximum log retention time or size that is well within your overall storage allocation so that if a CDC service can’t keep up with the pace of changes or stops completely, the database will delete old logs before letting them fill up all your storage space.
- Let the database fail first - If you don’t set a max log retention size, you are in effect deciding to let the database run out of storage space before letting it delete unread logs.
Securing Source DB and CDC Connector
Your Source Database should never be directly exposed to the internet. Choose the right means of connecting (from most to least secure):
- CDC in same private network - Typically only possible when running self-managed or on-prem.
- CDC connects via proprietary network tunnel - Configure the connection via AWS PrivateLink or GCP/Azure equivalent.
- CDC connects via Bastion Host - Run a bastion host server on the same private network as your database and make it (securely) open to SSH from your CDC Connector.
- CDC connector’s static IP added to Source Database firewall allowlist - keep in mind that this approach reduces security for others with access to the same IP.
Preparing Source DB for Initial Snapshot
CDC needs to load a snapshot of the current state before listening for changes. This process typically has a disproportionately high load on the source database.
Initial snapshots can take multiple days on a large DB. Every database will see CPU utilization immediately increase during snapshot, and storage will increase at a rate proportionate to database update activity because a log of all subsequent transactions must be kept around through the snapshot load. Some databases may even lock tables during the snapshot, resulting in major impacts to availability.
Debezium-Aurora MySQL Specific Warning: the initial snapshot phase will take a write-lock on tables you are replicating, preventing other connections from writing to those tables.
Perform a Dry Run
Use an offline replica of your production database to test the initial snapshot phase. Here is a checklist to go through during the test:
- How long does the initial snapshot take? - A snapshot of live production db with same resource allocation will take longer due to ongoing load.
- How long does it take to process 1hr of database activity? - useful for future outages, if 1hr of activity takes 15 minutes to process, CDC will take ~80 minutes to recover from a 4hr outage.
- What does resource utilization look like? - Watch especially for network utilization, check that bandwidth is below the level at which your host will throttle.
- Are tables writeable/readable during snapshot? - Open a separate connection and check that reading/writing continue to work as expected.
If testing shows that the default approach to the initial snapshot is not feasible, there are possible workarounds, but keep in mind you may have to re-snapshot tables and redo your workarounds in the future!
- Debezium has a feature called Incremental Snapshotting that can be used to reduce the load impact of the snapshot phase, even when first starting the connector. Start Debezium with
snapshot.mode = schema_onlyand then manually trigger an incremental snapshot.
- Start CDC connected to an offline replica of your Source Database, once the initial snapshot phase completes, manually update the CDC configuration to read from primary.
Expected load on Source DB during Streaming
CDC should have a comparable effect to setting up a read-replica in terms of impact to resource utilization on the Source Database. If replication is already active, the incremental impact of CDC should be minimal.
- CPU: Increasing the level of logging necessary for CDC may result in an increase in overall CPU utilization.
- Memory: No major impact.
- Storage: Storage increases due to more information and longer retention in logs.
- Network: During snapshot/re-snapshot, and during surges of change activity, CDC can have a compounding effect on network throughput.
Monitoring the Source Database
There are several database metrics that may not have been monitored before, but are important to monitor when CDC is active.
- Replication Log Size - Monitor the size of the change log kept around by the database, (MySQL: binlog, Postgres: WAL). An ever-increasing log size means either: (1) CDC cannot keep up with the pace of changes or (2) CDC has stopped consuming changes. Either scenario is one that needs to be addressed with urgency.
- Replication Status - You can typically separately monitor the status (active/inactive) of replication to alert when CDC has stopped consuming changes.
- Network Utilization - This should be already monitored, but the key with the addition of CDC is to watch for CDC-caused spikes in network that could lead to throttled connections.
- Disk Utilization - This should already be monitored, but it’s important to note that CDC introduces a new way that disk utilization can increase: change logs.
Runbooks for the Source Database
Database administrators typically use runbooks or workflows to follow a known and predictable process when things go wrong. There are a few ways that CDC can change these processes.
- Failover/Standby Promotion - If CDC is reading from the primary, you’ll need to add steps to reactivate streaming after promoting a replica to primary.
- Disk Full - Make a note to check the logical logs and replication statuses.
- Network Throttled - Add checks for contribution to network bandwidth from CDC
- Downtime - Coming back from downtime will require added steps to resume streaming. Coordinate how and when the CDC connector is resumed to reduce the thundering herd effect.
Operating the CDC Pipeline
Once the initial configuration and snapshot phases are done, all CDC production pipeline operators we spoke to relayed that the streaming phase of CDC is relatively low-maintenance. Most of the ongoing work in operating the pipeline comes from changes on either side, making this section fairly short.
Database usage patterns to watch out for
During the streaming phase, one of the ways a pipeline can fail is via an unexpected source database usage pattern. Here are two to watch out for:
- Massive transactions - Depending on the CDC tool and how the data is being used in the destination, a single massive transaction can force the typically-streaming tool to “batch” an update, possibly spiking memory requirements or increasing latency.
- Deleting lots of records at once - CDC emits an event for every delete, scheduled batch archive and delete jobs can inadvertently cause massive spikes in CDC throughput.
Monitoring the CDC Pipeline
The following metrics and liveness checks are crucial to operating a CDC pipeline in production:
CDC Connector - Two things to watch out for:
- Monitoring rate of new messages not sufficient: Depending on the level of activity in your database, no messages produced might just mean no changes from database.
- Debezium-specific: Just monitoring the health of Kafka Connect is not enough. Depending on the issue, the individual database-specific connector can be down with no impact to the Kafka Connect service.
CDC Loader - Monitor that updates are actively read downstream, this can often be done both in the pipeline and in the destination use-case. For example, in Kafka,
kafka_consumergroup_lagis a particularly useful metric to monitor as the movement of this metric indicates both producer and consumer activity. More details here
Runbooks for a CDC Pipeline
Prepare for operation of a production pipeline with runbooks that detail out workflows for the following scenarios:
- CDC Connector Down - Document the steps to take when updates are no longer being read into your pipeline. Time-to-response is important here because longer outages mean (1) More “catch-up” work required upon resolution and (2) Beyond a certain cut-off, outages may require complete re-snapshot process.
- Destination(s) not consuming updates - i.e. changes are piling up in your pipeline, this process can be designed to be less time-sensitive by allowing for increased retention windows in the pipeline.
- Re-Snapshot necessary - Whether caused by issues at the destination or changes to data required, the runbook for a “re-snapshot” should look like an abbreviated version of the initial snapshot process. Keep in mind that if multiple destinations are reading from the same pipeline, all destinations will see the influx of updates caused by the re-snapshot.
Handling Schema Evolution
Adding CDC functionality is essentially creating a new API for your database. For a CDC implementation to succeed, changes to database schema need to be handled with the same level of care and proactive communication as changes to any other internal API. There is typically a balancing act between: 1. Empowering your application developers to move fast, and 2. Ensuring schema evolution is handled proactively and smoothly downstream.
Tailor the level of procedures and controls to the size and maturity of your organization.
Important Decision: You typically have a choice:
- Allow breaking changes to schema, risk Destination fail - The CDC Connector will log that a change has been made, but seamlessly switch to streaming the new format of messages and assume your destination will handle it.
- Block breaking changes to schema, let CDC Connector fail - The CDC Connector will be blocked from logging messages with an incompatible format, stopping streaming until there is a manual intervention to resolve.
Risky schema evolution patterns for CDC
The more standardized and predictable the schema evolution process on your source database, the better. Here is a list of specific patterns of schema evolution that could cause problems when run on databases with CDC active downstream:
- Atomic renaming - (hot-swapping tables by changing two names in one transaction)
- Deleting and recreating tables - CDC is typically following changes from tables by internal ID, so a new table with a matching name won’t get picked up.
- TRUNCATE - Truncates are generally avoided, but especially bad with CDC active because they won’t register deletes on the rows that are removed.
Automation Tip: add automation or checks somewhere in the schema change lifecycle to reduce CDC breakage, here are some examples:
- Require PR approvals from CDC Operator on schema changes - [Proactive] This risks slowing down development pace, but allows CDC operator to proactively review changes before they have the potential to break CDC.
- Write automation that alerts CDC Operator of schema changes - [Reactive] Use your existing CI/CD pipeline to alert CDC Operator once a schema change has been made
- Use database triggers to alert on schema changes - [Reactive] If you can’t guarantee that all schema changes go through PR review process, you can use database triggers to execute only on certain DDL statements.
Change data capture as a design pattern is seeing widespread adoption as a means of replicating primary database state to different tools and use cases without overloading the database and without compromising latency. One crucial step for wider CDC adoption is a well-understood and standardized set of practices for operating it in production. This guide gives teams a starting point for implementing CDC in production. If you have feedback or suggestions, let us know in the Materialize community Slack - we treat this as a living document and will add and update the guide as CDC practices evolve.