Why is this transaction different from all the other transactions?
ACID transactions (Atomicity, Consistency, Isolation, Durability) are a familiar concept in the world of traditional databases, but how do they work with streaming-first data warehouses like Materialize? In our world, there are two types of transactions to consider.
- (1) Contained within Materialize: The first type is transactions that originate in the database itself, for example someone connects to Materialize and types
BEGIN; INSERT INTO ...; COMMIT;.
However, most of our customers don’t ingest data this way. They stream data in from an upstream source, such as Kafka or PostgreSQL.
- (2) Initiated outside Materialize: The second type is transactions that originate in the upstream source, for example someone connects to PostgreSQL and types
BEGIN; INSERT INTO ...; COMMIT;and then Materialize ingests this transaction. If the upstream source has transactional semantics, then Materialize will consume metadata about upstream transactions and respect the transactional boundaries.
Materialize doesn’t support transactions that mix statements run against both the upstream source and Materialize, though this concept does exist and is called a Federated Database.
This post explains the four properties of ACID, how they apply to regular databases, and how they apply to a streaming-first database like Materialize in a format that you may recognize as the four questions.
A - Atomicity
Why is it that all other transactions either apply all the containing statements or none of the statements, but this transaction only applied some of the statements?
This transaction violates atomicity. Atomicity states that all statements in a transaction should be treated as a single logical unit. Either all statements should be applied or none of them should be applied.
Materialize uses a concept called Virtual Time to assign a single timestamp to all statements within a transaction. The storage layer ensures that all transactions that originate in Materialize are atomically applied at that timestamp.
Materialize will wait to receive all statements from a transaction in an upstream source, and then apply all the statements together at a single timestamp. Note, this does not apply to Kafka atomic writes.
C - Consistency
Why is it that all other transactions leave the database in a consistent state, but this transaction leaves the database in an inconsistent state?
This transactions violates consistency. Consistency states that if the database started in a consistent state and the transaction is consistent, then the database stays in a consistent state when the transaction commits. Consistent here means that the database adheres to all database and application level constraints. For example, all foreign keys correspond to an existing primary key.
This should not be confused with consistency from the CAP theorem, which provides guarantees about the order of events in a distributed system, and is probably the more well known and more important consistency property.
This is also probably the most confusing property in ACID to understand, because it’s not wholly in the database’s control to guarantee. For example, if there’s an application constraint that some integer column should always be positive, and an application developer makes a typo and sets a value in that column to negative, then they have violated consistency. It is rumored that the C in ACID was only added to make the acronym sound better, forever leading to the confusion of application developers.
If a transaction that originates in Materialize or an upstream source is consistent, and Materialize was in a consistent state before the transaction, then Materialize will be in a consistent state after the transaction.
I - Isolation
Why is it that all other transactions don’t see the effects of an in-progress transaction, but this transaction sees writes from an uncommitted transaction?
This transaction violates isolation. Isolation states that the execution of a transaction should be isolated from the execution of all concurrently running transactions. One way to think about this is, that your transaction can pretend that is has exclusive access to the database for the duration of the transaction. There are actually multiple isolation levels which usually trade off performance for some isolation guarantees. They allow users to increase the performance of transactions at the cost of violating some aspect of isolation.
Materialize supports two isolation levels. The first one is Serializable, which is the strongest isolation level in the SQL standard and prevents any form of transaction interleaving within Materialize. The second level is Strict Serializable, which isn’t in the SQL standard and ensures that all transactions are linearizable. The configured isolation level applies to all transactions that originate in Materialize.
The serializable guarantee extends to transactions ingested from upstream PostgreSQL and Debezium sources, as long as the upstream source itself is serializable, with one known limitation. So there is no interleaving of transactions that originate in Materialize and that are ingested from an upstream source. A more mathematical way to look at this is that there is some total order of all transactions that originate in Materialize and that are ingested from an upstream source. The strict serializable guarantee does not apply to transactions that originate in an upstream source. That is, the transactions from upstream sources are not linearizable with transactions within Materialize. We are working on a feature called “real-time recency”, which would guarantee that reads in Materialize reflect the most up to date data in an upstream source, as of the start of that read. You can follow progress on real time recency here.
You can read more about Materialize isolation levels in our docs.
D - Durability
Why is it that the effects of all other transactions stay committed even after the database crashes, but the effects of this transaction disappeared after a crash?
This transaction violates durability. Durability states that after a database commits, the effects of that transaction must be persisted even after a restart or crash. Essentially the effects of a transaction need to be written down somewhere before it is committed.
For transactions that originate in Materialize, we will ensure that all transaction data is securely stored in S3 before the transaction has committed.
For transactions that originate in an upstream source, it is up to that upstream source to ensure that the transaction is durable before committing, until Materialize ingests the transaction. As long as the transaction is durable in the upstream source, Materialize can act as an asynchronous read replica and consume the data at any point after the transaction. Once the data is ingested, Materialize will further store any relevant data needed for Sources and Materialized Views in S3, in case the upstream source has a retention period that deletes the data after a specified time period.
Why go through the trouble of following ACID principles in a streaming DB? Materialize aims to take a fairly novel capability (efficient incremental view maintenance on arbitrarily complex SQL, powered by a stream processor) and make it as predictable and unsurprising as possible. A big part of that is designing a system that adheres as closely as possible to the same fundamental properties that establish predictability in traditional databases.