Clusters, explained with Data Warehouses
If you're familiar with data warehouses, this article will help you understand Materialize Clusters in relation to well-known components in Snowflake.
Materialize is a streaming database used by teams to serve continually-updated data transformations using the same SQL workflows used with batch processing. To see if it works for your use-case, register for access here.
In 2022, we took Materialize from a single binary to distributed cloud platform. At the heart of this update is the separation of storage and compute. This is an exciting change! Being able to scale on demand to any number of concurrent workloads helps teams focus on their use cases without needing to make complex infrastructure decisions. We've already seen how the separation of storage and compute helped with the adoption of cloud data warehouses like Snowflake.
Before tools like Snowflake, managing a data warehouse was primarily the job of a DBA tasked with constantly tuning and ensuring the health of the system. Many of those concerns have now been abstracted away. Without having to worry about the internals of the system, a range of roles can freely develop on top of the warehouse.
Materialize has done the same for streaming: Using best practices from data warehouses, streaming applications no longer have to be limited to a specialized group of engineers. Instead, multiple teams can safely build real-time applications without worrying about impacting other workflows.
If you are familiar with data warehouses, most of that knowledge will carry over as we discuss how best to build applications with Materialize. Here's a high-level diagram of Materialize storage, compute and control layers compared with Snowflake's:
In the rest of this article, we'll explain key components of Materialize architecture by comparing them to familiar parts of Snowflake.
Let’s start with the compute layer. In Snowflake, this is where all SQL queries are processed as they are received. In Materialize, compute is responsible for that and more, specifically:
- Streaming data in and out of Materialize - Sources and Sinks, which we’ll discuss more below.
- Running continuous transformations on data - SQL Views
- Maintaining in-memory results - Indexes
This is a lot of work! We already discussed how separating compute and storage helps reduce operational burden, but even within compute we need to be able to isolate resources from each other as we scale. To keep this organized and easy to use, we need a single logical model for the isolation within compute. We accomplish this with clusters.
Clusters are the logical components that express resource isolation. This may sound similar to virtual warehouses in Snowflake. That is intended as they offer the same ability to scale our compute resources separate from our storage layer. To be precise, a cluster with one replica is the Materialize equivalent of a virtual warehouse in Snowflake. We'll talk more about replicas soon.
In Snowflake, you create multiple virtual warehouses for different processes. Within your account, you might create a virtual warehouse for use by your ETL tool, another warehouse for ad hoc queries and a third for data science applications.
Isolating your virtual warehouses helps ensure that no single process interferes with the performance of other tasks or teams. It also allows you to tailor the resource needs of our different processes and separate costs. Your ETL warehouse that runs small copy queries throughout the day may have very different needs than long-running queries to train a recommendations model.
Mapping our Snowflake example onto Materialize, we can use clusters instead of virtual warehouses. In our example, we isolate our ETL, business intelligence and data science workflows into their own clusters.
This gives us an ideal balance of isolation and configurability:
- Each cluster can be sized based on the needs of the workload
- There is no concern of a rogue change in BI taking down the feature store in DS
- But the low-level transformation work needed by both BI and DS can be efficiently consolidated to the ETL cluster.
We keep referring to "a cluster with one replica" so now is a good time to explain what a replica is. Remember, clusters are just the logical component. Replicas are the corresponding physical component. They contain the physical allocations of resources (CPU and Memory) and actually do all the computation work.
Why not just use a single component like a virtual warehouse? There's a good reason which we'll get to in the next section. But, for now, it's safe to think of "a cluster with one replica" as being the most widely used configuration of compute in the Materialize world, similar to a virtual warehouse in Snowflake.
Replicas have a size (from
6xlarge) indicating the resources (i.e.: CPU and Memory) allocated to them.
Each increase in size represents a doubling of resources.
Picking the correct size for your replicas depends on the complexity and number of views maintained within the cluster.
Clusters with multiple Replicas
At this point, we have isolated our clusters and allocated physical resources. The last thing we need to cover are clusters with multiple replicas.
Note: Clusters and replicas have no similarity to multi-cluster virtual warehouses in Snowflake. (In Snowflake multi-clusters are used primarily to deal with any queries that may be queued and waiting to be executed against the warehouse, which isn’t applicable in Materialize).
Materialize uses active replication, so as we add multiple replicas, all of them will perform the same work within the cluster.
Replication for high availability
Running a cluster with multiple replicas gives greater tolerance to hardware failure and helps ensure our dataflows are as up-to-date as possible.
Replication for scaling up and down
Materialize also allows clusters to have replicas of different sizes. This allows you to scale active clusters without any downtime. For example, if we continued to add new materialized views to the BI cluster, we could increase the cluster resources by attaching a larger replica. After the new replica has finished catching up to the original, we can remove the original.
In Materialize, SQL transformations run continuously, so it's useful to have a way to scale up and down with minimal disruption.
Materialize is meant for actively changing data: users want access to their calculations and aggregations as soon as new data comes in. Most of this data comes in via sources, which read Kafka topics or CDC events from databases.
If you're familiar with Snowflake's Snowpipe Kafka Service, Materialize Sources are doing a similar job: Both services reach into customer infrastructure and pull data in to the database.
Snowpipe runs separately from virtual warehouses. It streams data from Kafka and batches messages into bulk updates to raw tables in the storage layer.
In Materialize, a Source runs within a cluster, it consumes data from Kafka or Postgres, handles upsert logic and some initial formatting, and continually writes updates to storage.
Besides the obvious stream vs batch differences, the change in architecture from Snowpipe to Materialize Sources has a few important implications:
- Sources are not billed separately - Snowpipe costs are billed separately from virtual warehouses. Sources run on Clusters, which are billed via compute credits.
- Admins can decide how best to arrange Sources on Clusters - Each Source can be isolated on its own Cluster, other Clusters can have multiple Sources to save money.
- Sizing Clusters for Sources is important - See below.
Sizing Clusters for Sources
Selecting the correct size for the cluster replica(s) running your Sources depends on the throughput and envelope type of your data. Sources that ingest a lot of events or sources that upsert data, opposed to appending data, will be more intensive and require more CPU and memory.
The recommended workflow for sizing a new source is to first create the source on a
large cluster replica and check the resource utilization graphs to see whether the replica can be sized down. If a Source is only using 20% resources on a
large, it can be sized down two-levels to a
small. (Keeping in mind, available resources double for each size.)
The other thing worth mentioning is, like in Snowflake, Materialize never sets a limit around storage. The Source running in the compute layer is handling the transport of data but the actual data is written down to S3. All the data we bring in is persisted in one place without us having to worry about allocating space or determining where specific sources should be stored. Having all our data in one place also allows us to bring different compute resources to the data.
Modern data teams need tools that scale. No one wants to re-architect their system for an unexpected uptick in traffic. But scaling should not be a technical burden. Luckily Materialize allows for building real-time data products with the same flexibility and resilience that people have come to expect from data warehouses. Using configurable and limitless storage and compute primitives on Materialize, teams can build and grow their streaming infrastructure in a way that matches their needs.