Architecture overview

Materialize was first built as a single binary that runs on a single node: materialized. To support mission-critical deployments at any scale, we are now evolving the binary into a cloud-native platform with built-in horizontal scaling, active replication and decoupled storage. To learn more about the future architecture of Materialize, sign up for early access.

The `materialized` binary

The materialized process (pronounced materialize-dee; the “d” is for daemon) interfaces with the outside world using:

SQL shells for interacting with clients, including defining sources, creating views, and querying data.
Sources to ingest data, i.e. writes. We’ll focus on streaming sources like Kafka, though Materialize also supports file sources.

Above: Materialize deployed with multiple Kafka feeds as sources.

Below: Zooming in on Materialize’s internal structure in the above deployment.

SQL shell: interacting with clients

Right now, Materialize provides its interactive interface through psql running locally on a client machine; this uses the PostgreSQL wire protocol (pgwire) to communicate with materialized.

Because this is a SQL shell, Materialize lets you interact with your server through SQL statements sent over pgwire to an internal queue, where they are dequeued by a sql thread that parses the statement.

Broadly, there are three classes of statements in Materialize:

Creating sources to ingest data from an external system, which is how data gets inserted into Materialize
Reading data from sources
Creating views to maintain the output of some query

Creating sources

When Materialize receives a CREATE SOURCE... statement, it connects to some external system to read data. You can find more information about how that works in the Sources section.

Reading data

As with any SQL API, you read data from Materialize using SELECT statements. When the sql thread parses a SELECT statement, it generates a plan. Plans in Materialize are dataflows which can be executed by Differential. The plan gets passed to the dataflow package, which works as the glue between Materialize and its internal Differential engine.

Differential then passes this new dataflow to all of its workers, which begin processing. Once Differential determines that the computation is complete, the results are passed back to the client, and the dataflow is terminated.

Unfortunately, if the user passes the same query to Materialize again, it must repeat the entire process––creating a new dataflow, waiting for its execution, etc. The inefficiency of this is actually Materialize’s raison d’être, and leads us to the thing you actually want to do with the software: creating views.

Creating views

If you know that you are routinely interested in knowing the answer to a specific query (how many widgets were sold in Oklahoma today?), you can do something much smarter than repeatedly ask Materialize to tabulate the answer from a blank slate. Instead, you can create a materialized view of the query, which Materialize will persist and continually keep up to date.

When users define views (that is, CREATE MATERIALIZED VIEW some_view AS SELECT...), the internal SELECT statement is parsed––just as it is for ad hoc queries––but instead of only executing a single time, the generated dataflow persists. Then, as data comes in from Kafka, Differential workers collaborate to maintain the dataflow and its attendant view.

To read data from views (as opposed to ad hoc queries), users target the view with SELECT * FROM some_view; from here, Materialize can simply return the result from the already-up-to-date view. No substantive processing necessary.

Reading data vs. creating views

The difference between simply reading data and creating a view is how long the generated dataflow persists.

When you perform an ad hoc SELECT, the dataflow only sticks around long enough to generate an answer once before being terminated.
For views, the dataflows persist indefinitely and are kept up to date. This means you can get an answer an indefinite number of times from the generated dataflow. Though getting these answers isn’t “free” because of the incremental work required to maintain the dataflow, it is dramatically faster to perform a read on this data at any given time than it is to create an answer from scratch.

In some cases, when the queries are relatively simple, Materialize can avoid creating a new dataflow and instead serves the result from an existing dataflow.

Sources: Ingesting data

For Materialize to ingest data, it must read it from a source or a table. Sources come in two varieties:

Streaming sources, like Kafka
File sources, like .csv or generic log files

File sources are more straightforward, so we’ll focus on streaming sources.

When using a streaming source, Materialize subscribes to Kafka topics and monitors the stream for data it should ingest.

As this data streams in, all Differential workers receive updates and determine which––if any––of their dataflows should process this new data. This works because each Differential worker determines the partitions it’s responsible for; the isolation that this self-election process provides prevents contention. Phrased another way, you don’t have to worry about two workers both trying to process the same piece of data.

The actual processing of Differential workers maintaining materialized views is also very interesting and one day we hope to explain it to you here. In the meantime, more curious readers can take the first step towards enlightenment themselves.

Implicit in this design are a few key points:

State is all in totally volatile memory; if materialized dies, so does all of the data.
Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data. However, you can union streaming and file sources in views, which accomplishes a similar outcome.

Learn more

Check out:

Key Concepts to better understand Materialize’s API
Get started to try out Materialize