Durable subscriptions
Subscriptions allow you to stream changing results from Materialize to an external application programatically. Like any connection over the network, subscriptions might get disrupted for both expected and unexpected reasons, in which case it’s useful to have a mechanism to gracefully recover data processing.
To avoid the need for re-processing data that was already sent to your external application following a connection disruption, you can:
-
Adjust the history retention period for the objects that a subscription depends on, and
-
Access past versions of this data at specific points in time to pick up data processing where you left off.
History retention period
You must contact us to enable this feature in your Materialize region.
By default, all user-defined sources, tables, materialized views, and indexes keep track of the most recent version of their underlying data. To gracefully recover from connection disruptions and enable lossless, durable subscriptions, you can configure the objects the subscription depends on to retain history.
To configure the history retention period for an object, use the RETAIN HISTORY
option in its CREATE
statement. This value can also be adjusted at
any time using the object-specific ALTER
statement.
Semantics
Increasing the history retention period
When you increase the history retention period for an object, both the existing historical data and any subsequently produced historical data are retained for the specified time period.
-
For sources, tables and materialized views: increasing the history retention period will not restore older historical data that was already outside the previous history retention period before the change.
-
For indexes: if all of the underlying source, table, and materialized view historical data is available for the increased history retention period, the index can use that data to backfill as far back as the underlying historical data is available.
See also Considerations.
Decreasing the history retention period
When you decrease the history retention period for an object:
-
Newly produced historical data is retained for the new, shorter history retention period.
-
Historical data outside the new, shorter history retention period is no longer retained. If you subsequently increase the history retention period again, the older historical data may already be unavailable.
See also Considerations.
Set history retention period
To set the history retention period for sources, tables, materialized views, and indexes, you can either:
-
Specify the
RETAIN HISTORY
option in theCREATE
statement. TheRETAIN HISTORY
option accepts positive interval values (e.g.,'1hr'
). For example:CREATE MATERIALIZED VIEW winning_bids WITH (RETAIN HISTORY FOR '1hr') AS SELECT auction_id, bid_id, item, amount FROM highest_bid_per_auction WHERE end_time < mz_now();
-
Specify the
RETAIN HISTORY
option in theALTER
statement. TheRETAIN HISTORY
option accepts positive interval values (e.g.,'1hr'
). For example:ALTER MATERIALIZED VIEW winning_bids SET (RETAIN HISTORY FOR '2hr');
View history retention period for an object
To see what history retention period has been configured for an object, look up
the object in the
mz_internal.mz_history_retention_strategies
catalog table. For example:
SELECT
d.name AS database_name,
s.name AS schema_name,
mv.name,
hrs.strategy,
hrs.value
FROM
mz_catalog.mz_materialized_views AS mv
LEFT JOIN mz_schemas AS s ON mv.schema_id = s.id
LEFT JOIN mz_databases AS d ON s.database_id = d.id
LEFT JOIN mz_internal.mz_history_retention_strategies AS hrs ON mv.id = hrs.id
WHERE mv.name = 'winning_bids';
If set, the returning result includes the value (in milliseconds) of the history retention period:
database_name | schema_name | name | strategy | value
---------------+-------------+--------------+----------+---------
materialize | public | winning_bids | FOR | 7200000
Unset/reset history retention period
To disable history retention, reset the history retention period; i.e., specify
the RESET (RETAIN HISTORY)
option in the ALTER
statement. For example:
ALTER MATERIALIZED VIEW winning_bids RESET (RETAIN HISTORY);
Considerations
Resource utilization
It’s important to note that increasing the history retention period for an object will lead to increased resource utilization in Materialize.
-
For sources, tables and materialized views: increasing the history retention period for these objects increases the amount of historical data that is retained in the storage layer. You can expect storage resource utilization to increase, which may incur additional costs.
-
For indexes: increasing the history retention period for an index increases the amount of historical data that is retained in memory in the cluster maintaining the index. You can expect memory resource utilization for that cluster to go up, which might require you to size up your cluster and consume additional compute credits.
Best practices
-
Given the additional memory costs associated with increasing the history retention period on indexes, if you don’t need the performance benefits of an index, consider creating a materialized view for your subscription query and configuring the history retention period on that materialized view instead.
-
Similarly, because of the increased storage costs and processing time for the additional historical data on whichever object has an increased history retention period, consider configuring history retention period on the index or materialized view directly powering the subscription, rather than all the way through the dependency chain from the source to the index or materialized view.
Clean-up
The history retention period represents the minimum amount of historical data guaranteed to be retained by Materialize. History clean-up is processed in the background, so older history may be accessible for the period of time between when it falls outside the retention period and when it is cleaned up.
Enabling durable subscriptions in your application
-
In Materialize, configure the history retention period for the object(s) queried in the
SUBSCRIBE
. Choose a duration you expect will allow you to recover in case of connection drops. One hour (1h
) is a good place to start, though you should be mindful of the impact of increasing an object’s history retention period. -
In order to restart your application without losing or re-snapshotting data after a connection drop, you need to store the latest timestamp processed for the subscription (either in Materialize, or elsewhere in your application state). This will allow you to resume using the retained history upstream.
-
The first time you start the subscription, run the following continuous query against Materialize in your application code:
SUBSCRIBE (<your query>) WITH (PROGRESS, SNAPSHOT true);
If you do not need a full snapshot to bootstrap your application, change this to
SNAPSHOT false
. -
As results come in continuously, buffer the latest results in memory until you receive a progress message. At that point, the data up until the progress message is complete, so you can:
- Process all the buffered data in your application.
- Persist the
mz_timestamp
of the progress message.
-
To resume the subscription in subsequent restarts, use the following continuous query against Materialize in your application code:
SUBSCRIBE (<your query>) WITH (PROGRESS, SNAPSHOT false) AS OF <last_progress_mz_timestamp>;
In a similar way, as results come in continuously, buffer the latest results in memory until you receive a progress message. At that point, the data up until the progress message is complete, so you can:
- Process all the buffered data in your application.
- Persist the
mz_timestamp
of the progress message.
You can tweak the flush interval at which you durably record the latest progress timestamp, if every progress message is too frequent.
Note about idempotency
The guidance above recommends you buffer data in memory until receiving a
progress message, and then persist the data and progress message mz_timestamp
at the same time. This is to ensure data is processed exactly once.
In the case that your application crashes and you need to resume your subscription using the
persisted progress message mz_timestamp
:
- If you were processing data in your application before persisting the subsequent progress message’s
mz_timestamp
: you may end up processing duplicate data. - If you were persisting the progress message’s
mz_timestamp
before processing all the buffered data from before that progress message: you may end up dropping some data.
As a result, to guarantee that the data processing occurs only once after your
application crashes, you must write the progress message mz_timestamp
and all
buffered data together in a single transaction.