Disaster recovery

The following outlines various disaster recovery (DR) strategies for Materialize.

Level 1: Basic configuration (Intra-Region Recovery)

Because Materialize is deterministic and its infrastructure runs on a container scheduler (AWS MSK), basic Materialize configuration provides intra-region disaster recovery as long as:

  • Materialize can spin up a new pod somewhere in the region, and

  • S3 is available.

In such cases, your mean time to recovery is the same as your compute cluster’s rehydration time.

💡 Recommendation
When running with the basic configuration, we recommend that you track your rehydration time to ensure that it is within an acceptable range for your business’ risk tolerance.

Level 2: Multi-replica clusters (High availability across AZs)

NOTE: The hybrid strategy is available if your deployment uses a three-tier or a two-tier architecture.

Materialize supports multi-replica clusters, allowing for distribution across Availability Zones (AZs):

  • For clusters sized up to and including 3200cc, Materialize guarantees that all provisioned replicas in a cluster are distributed across the underlying cloud provider’s availability zones.

  • For clusters sized above 3200cc, even distribution of replicas across availability zones cannot be guaranteed.

Multi-replica compute clusters and multi-replica serving clusters (excluding sink clusters) with replicas distributed across AZs provide DR resilience against: machine-level failures; rack and building-level outages; and AZ level failures for those clusters:

  • With multi-replica compute clusters, each replica performs the same work.

  • With multi-replica serving clusters (excluding sink clusters), each replica processes the same queries.

As such, your compute and serving clusters will continue to serve up-to-date data uninterrupted in the case of a replica failure.

💡 Cost and work capacity
  • Each replica incurs cost, calculated as cluster size * replication factor per second. See Usage & billing for more details.

  • Increasing the replication factor does not increase the cluster’s work capacity. Replicas are exact copies of one another: each replica must do exactly the same work as all the other replicas of the cluster(i.e., maintain the same dataflows and process the same queries). To increase the capacity of a cluster, you must increase its size.

If you require resilience beyond a single region, consider the Level 3 strategy.

Level 3: A duplicate Materialize environment (Inter-region resilience)

NOTE: The duplicate environment strategy assumes the use of Infrastructure-as-Code (IaC) practice for managing the environment. This ensures that catalog data, including your RBAC setup, is identical in the second environment.

For region-level fault tolerance, you can choose to have a second Materialize environment in another region. With this strategy:

  • You avoid complicated cross-regional communication.

  • You avoid state dependency checks and verifications.

  • And, because Materialize is deterministic, as long as your upstream sources can also be accessed from the second region, the two Materialize environments can guarantee the same results.

💡 No strict transactional consistency between environments
This approach does not offer strict transactional consistency across regions. However, as long as both regions are caught up, the results should be within about a second of each other.

The duplicate Materialize environment setup can be adapted into a more cost-effective setup if your deployment uses a three-tier or a two-tier architecture. For details, see the hybrid variation.

Hybrid variation

NOTE:
  • The hybrid strategy is available if your deployment uses a three-tier or a two-tier architecture.

  • The duplicate environment strategy assumes the use of Infrastructure-as-Code (IaC) practice for managing the environment. This ensures that catalog data, including your RBAC setup, is identical in the second environment.

For a more cost-effective variation to the duplicate Materialize environment in another region, you can choose a hybrid strategy where:

  • Only the sources clusters are running in the second Materialize environment.

  • The compute clusters are provisioned only in the event of an incident.

When combined with a multi-replica approach, you have:

  • Immediate failover during an AZ failure.

  • Downtime equal to hydration time during intra-region failover.

Back to top ↑