View Maintenance: A New Approach to Data Processing

February 24, 2020

Frank McSherry

Chief Scientist

It's with a great deal of excitement, and some trepidation, that we are now able to show off what we have been working on at Materialize.

The Materialize product is a "streaming data warehouse", which is a class of product you might not be familiar with as, to the best of our knowledge, there wasn't really such a thing until now. In this post we'll go through what a streaming data warehouse entails, why you might be fascinated by one (perhaps to use, perhaps to understand), and what sorts of interesting work goes in to making such a thing. There will be a variety of follow-on blog posts from yours truly, and others at Materialize!

This blog post is meant to pique your curiosity, and to get you to start thinking of all the new and exciting things you or your business could do with this tool.

Some historical context

To understand where Materialize fits in to the data ecosystem, let's first take a quick tour through the traditional categories in the data management and processing space. We'll visit traditional transaction processors (OLTP) and analytic processors (OLAP), and then introduce what we think is an interesting twist: incremental view maintenance (OLVM).

Transaction processing (OLTP)

Transaction processing is the beating heart of most databases. One describes a collection of tables, their schemas, and then allows users to fire off requests that add to, remove from, and generally update the records in these tables. These updates can be done transactionally, so that groups either all commit or do not, so as to ensure various invariants are maintained about the data (often: that the data appear as if interacted with through a sequence of actions, rather than willy nilly).

The archetypical example of transaction processing is a bank, where you have people with various accounts, and you'd like to transfer funds from one account to another. This involves incrementing the balance of one account, decrementing the balance of another account, and doing either both or neither and only if the decrement would not go negative. You can probably see that trying to do many of these at the same time has the potential to conflict, and perhaps that doing transaction processing well requires careful thought and design.

At the same time, there are tasks that aren't quite as intricately tangled up. If you just want to read your balance out, or if the bank wants an accounting of the distribution of balances across its customers, or wants to perform broader analysis of the transactions it has processed, you should be able to do this without being on the critical path of transaction processing itself.

Analytic processing (OLAP)

Analytic processing captures a wide class of tasks where one wants to read from the source-of-truth data, and perform broader computations across all of the data. These queries often involve full scans of some relation, joins with other relations, and aggregation and accumulation of the results by certain key columns. First, the work itself is now substantial enough that we can't easily fit it in to the standard transaction processor, and need instead a new tool specifically designed to handle this form of data processing. Second, the optimal physical layout and data structures for this work is very different from what is best for transaction processing.

As an example, one might want to determine the amount of money transferred between countries, by summing across all transfers and using the country on record for the sender and receiver. To answer this question an analytic processor needs to be ready to scan all transactions, and have random access into indexed customer data to pull out the source and destination countries. Different queries might require other indexes, other subsets of columns, and generally thoughtful layout of data with processing in mind.

While analytic processors greatly improve the throughput of analytic queries, they still tackle each problem from scratch. For tasks that involve monitoring, dashboards, or continual analytics, this type of approach quickly saturates: just re-reading the data takes some minimal amount of time, and doing so for all of the queries of all of the users occupies even a very efficient analytic processor.

View maintenance (OLVM) (NEW!)

View maintenance is the task of updating the results of queries (possibly analytic queries) when the underlying data change. Although OLTP and OLAP systems support views, they are not architected to efficiently maintain these views as the data change. Systems designed for view maintenance can often handle substantially higher load for workloads that re-issue the same questions against changing data: they perform work proportional to the volume of changes in the source data, rather than in proportion to the number of times the results need to be inspected.

To take the example above, imagine a bank wanted to monitor the amount of money moving between countries, with information that is as fresh as possible (to notice abnormalities as soon as possible). The same query as above can be implemented to respond to changes to the inputs (transactions, and customer addresses) and produce changes in the totals that move between countries. Only when changes occur is work performed, and the current answers can simply be read out from the maintained results whenever needed.

View maintenance is a very powerful approach for an increasingly popular class of problems. It does not replace a transaction processor (it usually sits just downstream from one) nor an analytic processor (which is optimized for and excels at exploratory queries). However, if you need a continually fresh view of your data, and analytics that derive from your data, a view maintenance engine is the tool that will get you there.

Streaming data warehouses

Imagine you had a data warehouse that reacted to streams of changes; what could you do with it?

We've been repeatedly surprised by use cases that we did not anticipate, surfaced by real customers. We'll call out a few here that we've prepared as stand-alone demos, each of which we'll talk about in upcoming posts.

Interactive Business Intelligence

Products like Tableau and Looker, and more recently Metabase and Apache Superset, provide business users an interactive experience that helps them through the process of forming queries and digesting the results. They automatically determine statistics for columns, and propose thoughtful visualizations that distill down large volumes of data into tractable aggregates. The data are then presented back to the analyst through interactive dashboards.

To date, these products are all backed by OLAP engines, which repeatedly poll the database with their questions and populate the dashboards. As time goes on the dashboards refresh, but with the speed (and frequency) of the backing OLAP engine and its ETL pipeline.

In principle, you could replace the OLAP engine with a OLVM engine, and keep the dashboards as fresh as you like!

We did this with Metabase, though we had to tweak a few parameters to get it to refresh more frequently than once a minute (it lets us go down to once a second, but gets worried on your behalf at that point, assuming that the OLAP engine can't go any faster). We cooked up a demo based on ch-benchmark which hybridizes TPC-C transaction processing and TPC-H analytic processing workloads, and which uses Metabase to explore the base tables and analytic queries, and watch as they are maintained.

Real-time microservices

Microservice architectures have become all the rage for dealing with ad-hoc computation across streams of data. You write a bit of code that does some scope-bounded work or transformation, and then either have it read from an event hub (e.g. Kafka) or listen for explicit requests.

Unfortunately, "rage" is a super-appropriate term. While writing your first microservice does seem pretty sweet, keeping a fleet of them coordinated, consistent, durable, monitored, and such can be a software engineering nightmare. Transformations that join multiple sources of information are less elegantly expressed, and sharing and optimization across multiple microservices is essentially a non-starter.

Some large fraction of these microservices can be expressed as views over consistently changing input sources.

We've put together a microservice demo which collects high-volume usage information as well as evolving pricing information, joins the two, and presents the results as an accounting for each customer. Rather than a stack of microservices, this is all expressed with SQL, and is very easy to modify or extend.

Real-time log processing

Anyone who has worked with popular websites or complex systems understands the value in log processing. You might use Elasticsearch or Splunk to grind through and index your logs so that you can view them from a different direction, and take home some conclusion that isn't as immediate from the raw terabytes.

At the same time, these tools have limits, and there are times you want to see a thing they don't easily present.

Logs are, in essence, a continually evolving set of facts about your underlying systems, and maintaining indexed views and derived computation is exactly the sort of thing that OLVM does best.

We've put together a demo that hammers a web server, and attached Materialize to the generated logs. The logs are just comma-separated values, and the demo shows off how to turn these into relationally structured records, and then to determine and monitor things like the number of unique IP addresses that have visited the server, the number of pages viewed by each searcher, things like that. All expressed as vanilla SQL.

What's under the covers?

There is a lot of technical meat underneath Materialize that we'll want to talk through. And the plan is to talk through as many interesting parts as you have time to read about. Just to give you a taste of the sorts of things that make OLVM a fundamentally different sort of problem than your standard database or data processor, consider these:

Queries need to be dataflows

To compute and then maintain a query, we need to express it as a differential dataflow computation. We don't have the option to bail out and just compute the query a different way when it is hard to do it as dataflow. That means that things like SQL subqueries, order by / limit statements, and various flavors of inner, outer, and lateral joins all need to be expressed using dataflow idioms. Fortunately, they can be! But it takes some work.

There is a bunch of interesting thought that goes in to how to plan SQL idioms that were not intended so much for efficient execution as to absorb the cognitive burden of users. Subquery decorrelation, for example, requires a fair bit of unpacking to render as a dataflow of relational joins, and doing it both completely and well is non-trivial.

Control flow interruption is hard

Dataflows are designed to move information around and process batches of it at a time. They are not especially well equipped to handle fine-grained control flow interruptions as happen when you divide by a zero. Or overflow some addition. Or supply data that transiently violates invariants like unique and foreign key, or non-nullability constraints.

Instead, Materialize needs to not fall over just because the combination of your data and your query are in a bad space. There are transformations to pull here, that harden the computation and propagate "query errored" states that the user can see instead of their query results. Delightfully, if we do this well, the user can correct their data and return to observing the correct answers.

Query optimization is different

Queries in a traditional OLTP or OLAP processor are mostly about minimizing the time to complete the query. The faster this happens the more likely a transaction is to commit, and the more queries per second the system can support. An OLVM system has different goals: data updates need to retire with high throughput, while using as little memory as possible. To see where these are different, consider how easy it can be for an OLAP system to determine the Top K records in some view (a scan) as opposed to how challenging it can be for an OLVM system to maintain the Top K records under arbitrary changes to the inputs.

Query optimization in an OLVM system has different goals, and different constraints. Plans need to be both efficient to compute initially and efficient to maintain incrementally. Our Top-K query fragments, for instance, perform hierarchical aggregation, which bounds the update cost for arbitrary changes. Our join planning re-uses in-memory assets to minimize the incremental memory cost of each new query.

Wrapping up

We at Materialize are really excited to be opening up the doors on this new type of data processor. View Maintenance doesn't replace your existing data infrastructure, but it is such a powerful new addition that we think people will both want and need to reconsider what is possible. We will have a regular stream of technical deep-dives into what we've done, both inside and outside the Materialize product, and we both hope and plan to excite you!

Register for a Materialize account here to get started, sign up below for stimulating recurring email content, check out the documentation, or just rush over and grab a copy of the code itself to dig into the internals!