Materialize DR characteristics

The following provides various failure mode impact and recovery characteristics for Materialize.

System Environment

Materialize system components provide core features like database consensus, durable storage, network access and security, and compute provisioning. These components are deployed with high available configurations or with automated recovery mechanisms. However, they can still experience outages due to failures in underlying providers.

Failure Type	Impact
Single Availability Zone (AZ)	Connection issues using single-AZ Privatelink and sources/sinks. Brief `pgwire` and `https` connection drops as network rebalances.
Two Availability Zones	Temporary issues with cluster provisioning. Temporary issues with Console access.
Three or More Availability Zones	Partial to no access to the database. May require point-in-time recovery (PITR) of environments.
Single Region System Resources	There are metadata resources running in HA in us-east-1. An outage in us-east-1 may result in issues viewing the console for other regions. This does not affect database access, up-time, or performance.

Recommendation(s)

Use privatelink when possible and configure to use multiple AZs.
If you are concerned about multi-AZ outages, consider duplicate Materialize environment in second region strategy

Database environment

`environmentd`

The environmentd runs on a single node in a single AZ. environmentd has no data; as such, the RPO is N/A.

The component has the following failure characteristics:

Failure Type	RPO	RTO (RF1 - single AZ)	RTO (RF2 - multiple AZs)
Machine failure	N/A	Time to launch on new machine (~seconds to minutes).	N/A
Single AZ failure	N/A	Time to launch new instance in a new AZ.	N/A

RPO (Recovery Point Objective) • RTO (Recovery Time Objective) • RF (Replication Factor)

Key point(s)

If environmentd becomes unavailable, RTO is non-zero.
If environmentd becomes unavailable, its RTO affects the RTO of the clusters as you cannot access data while environmentd is unavailable.

Clusters

Failure Type RPO RTO (RF1 - single AZ) RTO (RF2 - multiple AZs)

Failure Type	RPO	RTO (RF1 - single AZ)	RTO (RF2 - multiple AZs)
Machine failure	0	Time to spin up new machine + possible rehydration time, depending on the objects on the machine: If non-upsert sources, no rehydration time(i.e., does not require rehydration). If upsert sources, rehydration time. If sinks, no rehydration time (i.e., does not require rehydration). If compute, rehydration time. If serving, rehydration time. Additionally, there may be some time to catch up with changes that may have occurred during the downtime. To reduce rehydration time, scale up the cluster.	Can be: 0 if only compute and serving objects are on the machine. Time to spin up new machine if sources or sinks are on the machine. In addition, cluster RTO is affected if the `environmentd` is down (seconds to minutes).
Single AZ failure	0	For managed clusters Time to spin up new machine + possible rehydration time, depending on the objects on the machine: If non-upsert sources, no rehydration time(i.e., does not require rehydration). If upsert sources, rehydration time. If sinks, no rehydration time (i.e., does not require rehydration). If compute, rehydration time. If serving, rehydration time. Additionally, there may be some time to catch up with changes that may have occurred during the downtime. To reduce rehydration time, you can scale up the cluster. During downtime, single AZ PrivateLinks are impacted.	Can be: 0 if only compute and serving objects are on the machine. Time to spin up new machine if sources or sinks are on the machine. In addition, cluster RTO is affected if the `environmentd` is down (seconds to minutes).
Regional failure (or 2 AZs failures)	At most, 1 hour (time since last backup, based on hourly backups).	~1 hour (time to check pointers).	High/Significant. Consider using a regional failover strategy.

Machine failure

Time to spin up new machine + possible rehydration time, depending on the objects on the machine:

If non-upsert sources, no rehydration time(i.e., does not require rehydration).
If upsert sources, rehydration time.
If sinks, no rehydration time (i.e., does not require rehydration).
If compute, rehydration time.
If serving, rehydration time.

Additionally, there may be some time to catch up with changes that may have occurred during the downtime.

To reduce rehydration time, scale up the cluster.

Can be:

0 if only compute and serving objects are on the machine.
Time to spin up new machine if sources or sinks are on the machine.

In addition, cluster RTO is affected if the environmentd is down (seconds to minutes).

Single AZ failure

For managed clusters

Time to spin up new machine + possible rehydration time, depending on the objects on the machine:

If non-upsert sources, no rehydration time(i.e., does not require rehydration).
If upsert sources, rehydration time.
If sinks, no rehydration time (i.e., does not require rehydration).
If compute, rehydration time.
If serving, rehydration time.

Additionally, there may be some time to catch up with changes that may have occurred during the downtime.

To reduce rehydration time, you can scale up the cluster.

During downtime, single AZ PrivateLinks are impacted.

Can be:

0 if only compute and serving objects are on the machine.
Time to spin up new machine if sources or sinks are on the machine.

In addition, cluster RTO is affected if the environmentd is down (seconds to minutes).

Regional failure (or 2 AZs failures) At most, 1 hour (time since last backup, based on hourly backups). ~1 hour (time to check pointers). High/Significant. Consider using a regional failover strategy.

RPO (Recovery Point Objective) • RTO (Recovery Time Objective) • RF (Replication Factor)

Key point(s)

Cluster RTO can be affected if the environmentd is down (seconds to minutes).
For regional failover strategy, you can use a duplicate Materialize environment strategy.

Materialize data corruption/operations error

Failure Type	RPO	RTO (RF1/RF2)
Non-data corruption errors	Maximum 1 hour (time since last backup, based on hourly backups).	Case specific
Data corruption errors	High/Significant. RPO is dictated by upstream system.	Case specific

RPO (Recovery Point Objective) • RTO (Recovery Time Objective) • RF (Replication Factor)

End-user error

Failure Type	RPO	RTO (RF1/RF2)
Accidental source drop (and dependent objects)	Same as upstream source system. Source will need to be recreated in Materialize. Consider using RBAC to reduce the risk of accidentally dropping sources.	Time to recreate the source and snapshot + time to recreate the dependent objects and rehydrate. Consider using RBAC to reduce the risk of accidentally dropping sources.
Accidental materialized view/index drop	0	Time to rehydrate.

Failure Type

RPO

RTO (RF1/RF2)

Accidental source drop (and dependent objects)

Same as upstream source system. Source will need to be recreated in Materialize.

Consider using RBAC to reduce the risk of accidentally dropping sources.

Time to recreate the source and snapshot + time to recreate the dependent objects and rehydrate.

Consider using RBAC to reduce the risk of accidentally dropping sources.

Accidental materialized view/index drop

Time to rehydrate.

RPO (Recovery Point Objective) • RTO (Recovery Time Objective) • RF (Replication Factor)

Key point(s)

You can use RBAC to reduce the risk of accidentally dropping sources (and other objects) in Materialize.

Materialize DR characteristics

System Environment

Database environment

environmentd

Clusters

Materialize data corruption/operations error

End-user error

See also

`environmentd`