“The real difficulty with lying is that you have to keep track of all the lies that you’ve told, and to whom” is a quote I once read that I can’t definitively source (it’s… inconsistently attributed to Mark Twain). It’s stuck with me because it captures the logic as to why it’s so hard to be productive as a programmer in a world of weak isolation models.

[Author’s note: database communities use the term “isolation,” and distributed systems communities use the term “strong consistency,” to refer to overlapping concepts. In the rest of this post, I will stick to the database terminology because this is all their fault in the first place.]

(Anoma)lies

If you lie to someone, you have to remember all the other things you’ve told to everyone else, and game out who might be talking to whom. Then, you have to reason about how you could get caught. This slows down your thinking and mental agility. Similarly, if you work with data platforms that do not provide strong isolation, you have to carefully consider how this might lead to error states or end-user visible inconsistencies. You’re potentially telling “lies,” and you need to keep track of them.

This slows down your development velocity. Most of your time is spent reasoning about architecture diagrams. You might be giving an inconsistent read to an unsuspecting client. You have to keep track of what services are not communicating through the database. I concede that the “lie” metaphor might be provocative, but it’s a good approximation for what an “anomaly” is in practice. And lying is a solid framework for understanding the concept of database consistency.

Some databases with weak isolation are correctly documented, because they promise nearly nothing, and deliver on this minimal promise. That’s not a lie (“I didn’t say I was going to check, you read into it…”). But in practice, this is misleading for developers. At the very least, it slows them down. As I’ll show later, even the most sophisticated database programmers often get contorted by the subtleties of weak isolation models.

Fundamentally, programming atop weak isolation demands a significant amount of work from developers. The case for building atop strong isolation is this: it enables local reasoning. The other dubiously sourced Mark Twain quote is “If you tell the truth, you don’t have to remember anything”. Databases with strong isolation are almost like oracles. They tell the truth all the time.

Translated to distributed systems, in this context you can interpret “isolation” quite literally: it allows programmers of a single query to reason about that query in isolation. On the other hand, weak isolation requires global reasoning, which means that every programmer writing queries against the system must be on the same page at all times.

When you give an inconsistent read, whether you will get away with it depends on which reads might end up conflicting downstream between other systems. And in a world where the database is accessed by multiple clients, you always need to reason about how they interact further downstream. Any errors will propagate outwards. This means that for any code change, the user has to consider the context of all the other queries that might hit the database.

A precedence graphs of (anoma)lies

Let’s model this formally. One strong isolation level is serializability, which can be achieved in two different ways. First, you can use a database that guarantees serializability. Or second, you can take all the queries that could run on the database, and construct a directed precedence graph, then check this graph for cycles.

This is a nice definition. You can have a set of queries that are conflict-serializable even if they run on a system that provides weaker guarantees. This is because they are cleverly designed not to interfere! A database that provides serializable isolation ensures that no transactions could ever cause a cycle.

But if you have a database that only provides snapshot isolation, it won’t catch one particular shape of cycle, called write skew. You can still ensure that the end result has no anomalies by manually inspecting the set of transactions you run. But this checking process is hard!

In practice, few people are actually doing this with great success (let alone using the formal algorithm). But given that Oracle only provides snapshot isolation (unhelpfully called “serializable” for historical reasons), there’s plenty of lore around what to be careful of when looking at the set of transactions. On this topic, consult your local Oracle DBA for more information.

As database guarantees get weaker than snapshot isolation, you have a wider set of anomalies that could potentially occur. That leads to even more hard-to-catch shapes in the precedence graph. This requires a wider set of checks that consider the complete set of all possible transactions running against a database. If your database is running in read committed mode (the postgres default), you have to ensure that it doesn’t allow phantom reads, lost updates, or unrepeatable reads, which is difficult1.

Honesty is often the best policy

Does this line up with all the checks you’re running across your distributed infrastructure? In practice, nobody is doing this to the formal standards of rigor. Nor are they incorporating the checks as part of every change to every database query. But you’re probably reasoning quite a bit about the common transaction paths. You’re drawing out full architecture diagrams and investigating any bugs with distributed traces. You’re looking for inconsistencies and patching them with some fencing around your queries.

My point is this: it is extremely wasteful. The hard truth is that global reasoning is the most expensive thing of all. It involves humans scheduling meetings and staring at the complete set of all possible transactions. Then they must review the proposed transactions by other programmers. And the most expensive part, by far, is the salary hours your employees dump into this process.

That said, weak isolation is not something to categorically exclude. Imagine you’re working on distributed infrastructure at unprecedented scales at one of the largest companies in the world. It might make sense to build bespoke high throughput infrastructure that has to make some careful tradeoffs in exchange for performance.

The FBI and CIA have involved and convoluted protocols to keep their lies straight. But is this an ideal pursuit for a database programmer? There’s an easier way to keep the answers straight. You can build a process to ensure that all subsequent changes do not create any inadvertent anomalies. However, it’s not something to take on casually: it’s a tool of last resort, when you’ve really hit the performance bottlenecks of strongly isolated systems.

Most developers building data infrastructure have the task of presenting upwards. They are in the business of building a database-like internal service. Once they get down to building their own inverted database with stream processors, Redis caches, or queues, they’re on the hook for delivering isolation guarantees. At the very least, they must correctly document and help their teams use the database correctly.

Enough with the anomalies!

In the particular case of streaming, isolation in stream processing is particularly difficult. Stream processors are typically deployed in situations where the inputs are unbounded and the computation is continuous. Many systems with weak isolation guarantees are designed with the informal goal of eventual consistency (i.e. we’ll get around to the truth… at some point).

But this doesn’t fit well with stream processing: if the inputs aren’t ever settled, eventual consistency could very well result in the outputs never being settled. That’s a large departure from most people’s expectations. Eventual consistency sets a potentially acceptable expectation that deviations are bounded and temporary. That’s very different from the situation of deviations being permanent and unbounded.

It’s possible, using stream processors, caches, key-value stores, and custom programs, to build a system that gives clear correctness guarantees to end-users. But it’s certainly not trivial. This guarantee is strict serializability. Strict serializability is the isolation guarantee that fits best with people’s natural intuitions around concurrency control, and the one that we deliver at Materialize.

At Materialize, we’ve put in quite a lot of work to build a system that is trustworthy, and we are clear about what that means for you. We’re betting that most of you don’t want to become consistency experts, and certainly don’t want to acquire that expertise during the course of an incident retro. Who wants to keep track of all those lies?

If you’re tired of keeping track of all those lies, sign up for a free trial of Materialize to leverage strong consistency.

Footnotes

1. Sometimes this can get quite subtle: for instance, Postgres supports an intermediate level called repeatable read. While repeatable read theoretically allows for phantom reads, Postgres goes one step further (https://www.postgresql.org/docs/current/transaction-iso.html). The Postgres implementation disallows phantoms. Since the ANSI standard defines four anomalies, from the table it looks like Postgres’ repeatable read implementation is as good as the serializable implementation, right? And if you do any performance benchmarking, repeatable read is faster than serializable. In practice, serializable is such a large performance hit that few people run Postgres in serializable mode. But not so fast. There is another, secret anomaly, unknown to the ANSI committee, called g2-item (https://news.ycombinator.com/item?id=23500134). And in repeatable read mode, Postgres allows it (https://jepsen.io/analyses/postgresql-12.3). So you’ll have to check your precedence graphs for that one.

Try Materialize Free