Materialize Documentation
s
Join the Community github/materialize

Architecture overview

Materialize was first built as a single binary that runs on a single node: materialized. To support mission-critical deployments at any scale, we are now evolving the binary into a cloud-native platform with built-in horizontal scaling, active replication and decoupled storage. To learn more about the future architecture of Materialize, sign up for early access.

The materialized binary

The materialized process (pronounced materialize-dee; the “d” is for daemon) interfaces with the outside world using:

Materialize deployment diagram

Above: Materialize deployed with multiple Kafka feeds as sources.

Below: Zooming in on Materialize’s internal structure in the above deployment.

Materialize internal diagram

SQL shell: interacting with clients

Right now, Materialize provides its interactive interface through psql running locally on a client machine; this uses the PostgreSQL wire protocol (pgwire) to communicate with materialized.

Because this is a SQL shell, Materialize lets you interact with your server through SQL statements sent over pgwire to an internal queue, where they are dequeued by a sql thread that parses the statement.

Broadly, there are three classes of statements in Materialize:

Creating sources

When Materialize receives a CREATE SOURCE... statement, it connects to some external system to read data. You can find more information about how that works in the Sources section.

Reading data

As with any SQL API, you read data from Materialize using SELECT statements. When the sql thread parses a SELECT statement, it generates a plan. Plans in Materialize are dataflows which can be executed by Differential. The plan gets passed to the dataflow package, which works as the glue between Materialize and its internal Differential engine.

Differential then passes this new dataflow to all of its workers, which begin processing. Once Differential determines that the computation is complete, the results are passed back to the client, and the dataflow is terminated.

Unfortunately, if the user passes the same query to Materialize again, it must repeat the entire process––creating a new dataflow, waiting for its execution, etc. The inefficiency of this is actually Materialize’s raison d’être, and leads us to the thing you actually want to do with the software: creating views.

Creating views

If you know that you are routinely interested in knowing the answer to a specific query (how many widgets were sold in Oklahoma today?), you can do something much smarter than repeatedly ask Materialize to tabulate the answer from a blank slate. Instead, you can create a materialized view of the query, which Materialize will persist and continually keep up to date.

When users define views (that is, CREATE MATERIALIZED VIEW some_view AS SELECT...), the internal SELECT statement is parsed––just as it is for ad hoc queries––but instead of only executing a single time, the generated dataflow persists. Then, as data comes in from Kafka, Differential workers collaborate to maintain the dataflow and its attendant view.

To read data from views (as opposed to ad hoc queries), users target the view with SELECT * FROM some_view; from here, Materialize can simply return the result from the already-up-to-date view. No substantive processing necessary.

Reading data vs. creating views

The difference between simply reading data and creating a view is how long the generated dataflow persists.

In some cases, when the queries are relatively simple, Materialize can avoid creating a new dataflow and instead serves the result from an existing dataflow.

Sources: Ingesting data

For Materialize to ingest data, it must read it from a source or a table. Sources come in two varieties:

File sources are more straightforward, so we’ll focus on streaming sources.

When using a streaming source, Materialize subscribes to Kafka topics and monitors the stream for data it should ingest.

As this data streams in, all Differential workers receive updates and determine which––if any––of their dataflows should process this new data. This works because each Differential worker determines the partitions it’s responsible for; the isolation that this self-election process provides prevents contention. Phrased another way, you don’t have to worry about two workers both trying to process the same piece of data.

The actual processing of Differential workers maintaining materialized views is also very interesting and one day we hope to explain it to you here. In the meantime, more curious readers can take the first step towards enlightenment themselves.

Implicit in this design are a few key points:

Learn more

Check out: