Materialize's unbundled cloud architecture

May 6, 2022

Frank McSherry

Chief Scientist

Materialize: Phase 2

It's been a while since we last told you what we at Materialize are up to. You might have thought "oh, probably more of the same; fast database stuff". As it turns out, well you aren't wrong, but we still think you'll be surprised.

For the past three years we’ve focused on building Materialize as a single binary. That binary interactively serves and incrementally maintains SQL queries really well. It does it so well, in fact, that user demand is pushing us beyond the limits of our current architecture. For that reason, our entire team is working on shipping our biggest change to date: unbundling our binary into a cloud native platform built out of infinitely scalable primitives.

Starting in September, Materialize is going horizontal.

Unbounded Scale

It makes sense that when investing in a platform, you don't want to discover scaling barriers.

You want it to support unbounded numbers of users and sessions.
You want it to support unbounded numbers of data sources, with unbounded volumes and rates.
You want it to support unbounded numbers of views over these data.

So we figured we'd do that.

We're doing the same thing that other smart people have done: "separating storage and compute". Smart people have learned that if you decouple the storage of data from the compute acting on the data, each of the parts can scale independent of the other. New data sources can spill into cloud storage without disrupting your existing installations. New use cases can invoke new, isolated compute resources without impacting existing workloads. If you ever need more of a thing, you can get it without interrupting anyone else.

What's new here is that smart people primarily do this for batch analytics.

Architecture

To remove the limits mentioned above, we've restructured Materialize's internal architecture. There is a lot to say about this, but let's start with just a sketch.

Materialize is based around a data model of explicitly timestamped changelogs of collections.

All inputs are first turned into these changelogs, and are durably recorded.
All views translate these changelogs into exactly corresponding output changelogs.
All queries are performed against such changelogs at specific times.

This data model gives us confidence that we are producing correct answers to specific questions.

However, our data model also allows us to unbundle Materialize's architecture. Ingestion, computation, and querying can each be performed and scaled independently. The explicit, durable timestamps ensure we provide consistent answers even across independent components.

There are a lot of other great features that come on line when you lean in to this data model. We are absolutely going to talk you through all of them.

Timeline

You may have a pile of technical questions, which is totally fair. We'll have a pile of technical details coming up soon. The code is actually public, so you can follow along (and perhaps you have been for the past months that we've been working on it).

We're not deploying or supporting the new horizontal architecture yet, but it should be available soon. The intended experience is essentially identical to the current Materialize, except that your sources and views are backed by an elastic set of resources. There is one new fundamental concept (the CLUSTER) that represents a co-location of in-memory indexed data assets, and between which there is performance and fault isolation. Otherwise, you still just use SQL and get your answers back quickly.

I'm more excited than I can clearly communicate.

Frank McSherry

Chief Scientist, Materialize

Frank was previously at Microsoft Research Silicon Valley where he co-invented Differential Privacy, and subsequently led the Naiad project. Frank holds a Ph.D in Computer Science from the University of Washington.

Materialize: Phase 2

Unbounded Scale

Architecture

Timeline

Related Resources