Quickstart with Materialize

This guide walks you through getting started with Materialize, including:

  • Connecting a streaming data source

  • Computing real-time results with indexes and materialized views

  • Simulating failure to see active replication in action

  • Scaling up or down based on computational needs

Before you begin

Materialize is wire-compatible with PostgreSQL, which means it integrates with most SQL clients and other third-party tools that support PostgreSQL.

In this guide, you’ll use psql to interact with Materialize, so make sure you have it installed locally.

Using Homebrew, install libpq and symlink the psql binary to /usr/local/bin:

brew install libpq
brew link --force libpq

Using apt, install the postgresql-client package:

sudo apt-get update
sudo apt-get install postgresql-client
Download and install the PostgreSQL installer certified by EDB.

Step 1. Start a free trial

With a free trial, you get 14 days of free access to Materialize resources worth up to 4 credits per hour. This limit should accommodate most trial scenarios. For more details, see Free trial FAQs.

NOTE: If you already have a Materialize account, skip to the next step.
  1. Sign up for a Materialize trial using your business email address.

  2. Activate your Materialize account.

    Once your account is ready, you’ll receive an email from Materialize asking you to activate your account. In the process, you’ll create credentials for logging into the Materialize UI.

Step 2. Prepare your environment

In Materialize, a cluster is an isolated environment, similar to a virtual warehouse in Snowflake. Within a cluster, you have replicas, which are the physical resources for doing computational work. Clusters are completely isolated from each other, so replicas can be sized based on the specific task of the cluster, whether that is ingesting data from a source, computing always-up-to-date query results, serving results to clients, or a combination.

For this guide, you’ll create 2 clusters, one for ingesting source data and the other for computing and serving query results. Each cluster will contain a single replica at first (you’ll explore the value of adding replicas later).

  1. In the Materialize UI, enable the region where you want to run Materialize.

    Region setup will take a few minutes.

  2. On the Connect screen, create a new app password and then copy the psql command.

    The app password will be displayed only once, so be sure to copy the password somewhere safe. If you forget your password, you can create a new one.

  3. Open a new terminal window, run the psql command, and enter your app password.

    In the SQL shell, you’ll be connected to a pre-installed default cluster from which you can get started.

  4. Use the CREATE CLUSTER command to create two new clusters, each with a single replica:

    CREATE CLUSTER ingest_qck REPLICAS (r1 (SIZE = '2xsmall'));
    
    CREATE CLUSTER compute_qck REPLICAS (r1 (SIZE = '2xsmall'));
    

    The 2xsmall replica size is sufficient for the data ingestion and computation in this getting started scenario.

  5. Use the SHOW CLUSTER REPLICAS command to check the status of the replicas:

    SHOW CLUSTER REPLICAS WHERE cluster IN ('compute_qck', 'ingest_qck');
    

       cluster   | replica |  size   | ready
    -------------+---------+---------+-------
     compute_qck | r1      | 2xsmall | t
     ingest_qck  | r1      | 2xsmall | t
    (2 rows)
    

    Once both replicas are ready (ready=t), move on to the next step.

Step 3. Ingest streaming data

Materialize supports streaming data from multiple external sources, including Kafka and PostgreSQL. The process for integrating a source typically involves configuring the source’s network and creating connection objects in Materialize.

For this guide, you’ll use a built-in load generator that simulates an auction house, where users bid on an ongoing series of auctions.

  1. Most objects in Materialize are namespaced by database and schema, including sources, so start by creating a unique schema within the default materialize database:

    CREATE SCHEMA qck;
    
    SET search_path = qck;
    
  2. Use the CREATE SOURCE command to create the auction house source:

    CREATE SOURCE auction_house
      IN CLUSTER ingest_qck
      FROM LOAD GENERATOR AUCTION
      FOR ALL TABLES;
    

    Note that the IN CLUSTER clause attaches this source to the existing ingest_qck cluster, but it’s also possible to create a cluster and replica at the time of source creation using the WITH SIZE option.

  3. Now that you’ve created a source, Materialize starts ingesting data into durable storage, automatically splitting the stream into multiple subsources that represent different tables. Use the SHOW SOURCES command to get an idea of the data being generated:

    SHOW SOURCES;
    

            name            |      type      | size
    ------------------------+----------------+------
     accounts               | subsource      |
     auction_house          | load-generator |
     auction_house_progress | subsource      |
     auctions               | subsource      |
     bids                   | subsource      |
     organizations          | subsource      |
     users                  | subsource      |
    (7 rows)
    

    In addition to the auction_house load generator source and its subsources, you’ll see auction_house_progress, which Materialize creates so you can monitor source ingestion.

  4. Before moving on, look at the schema of the source data you’ll be working with:

    SHOW COLUMNS FROM auctions;
    

       name   | nullable |           type
    ----------+----------+--------------------------
     id       | f        | bigint
     seller   | f        | bigint
     item     | f        | text
     end_time | f        | timestamp with time zone
    (4 rows)
    
    SHOW COLUMNS FROM bids;
    

        name    | nullable |           type
    ------------+----------+--------------------------
     id         | f        | bigint
     buyer      | f        | bigint
     auction_id | f        | bigint
     amount     | f        | integer
     bid_time   | f        | timestamp with time zone
    (5 rows)
    

Step 4. Compute real-time results

With auction data streaming in, you can now explore the unique value of Materialize: computing real-time results over fast-changing data.

  1. Switch to your compute_qck cluster:

    SET CLUSTER = compute_qck;
    
  2. Enable psql’s timing feature so you can see how quickly results are returned:

    \timing
    
  3. First, create a view:

    CREATE VIEW avg_bids AS
      SELECT auctions.item, avg(bids.amount) AS average_bid
      FROM bids
      JOIN auctions ON bids.auction_id = auctions.id
      WHERE bids.bid_time < auctions.end_time
      GROUP BY auctions.item;
    

    This view joins data from auctions and bids to get the average price of bids that arrived befored their auctions closed.

    Note that, as in other SQL databases, a view in Materialize is simply an alias for the embedded SELECT statement. Materialize computes the results of the query only when the view is called.

  4. Query the view a few times:

    SELECT * FROM avg_bids;
    

            item        |    average_bid
    --------------------+-------------------
     Custom Art         | 50.10550599815441
     Gift Basket        | 50.51195882531032
     City Bar Crawl     | 50.02785145888594
     Best Pizza in Town | 50.20555741546703
     Signed Memorabilia | 49.34376098418278
    (5 rows)
    
    Time: 738.987 ms
    
    SELECT * FROM avg_bids;
    

            item        |    average_bid
    --------------------+--------------------
     Custom Art         | 50.135432589422194
     Gift Basket        | 50.485373134328356
     City Bar Crawl     |  50.03637566137566
     Best Pizza in Town |  50.16190159574468
     Signed Memorabilia | 49.354624781849914
    (5 rows)
    
    Time: 707.403 ms
    

    You’ll see the average bid change as new auction data streams into Materialize. However, the view retrieves data from durable storage and computes results at query-time, so latency is high and would be much higher with a production dataset.

  5. Next, create an index on the view:

    CREATE INDEX avg_bids_idx ON avg_bids (item);
    

    🚀🚀🚀 This is where Materialize becomes a true streaming database. When you use an index, Materialize incrementally computes the results of the indexed query in memory as new data arrives.

  6. Query the view again:

    SELECT * FROM avg_bids;
    

            item        |    average_bid
    --------------------+--------------------
     Custom Art         | 49.783986655546286
     Gift Basket        |  49.93436483689761
     City Bar Crawl     |  49.93733653007847
     Best Pizza in Town |  50.43617136074242
     Signed Memorabilia |  50.09202958093673
    (5 rows)
    
    Time: 26.403 ms
    

    You’ll see the average bids continue to change, but now that the view is indexed and results are pre-computed and stored in memory, latency is down to 26 milliseconds!

  7. One thing to note about indexes is that they exist only in the cluster where they are created. To experience this, switch to the default cluster and query the view again:

    SET CLUSTER = default;
    
    SELECT * FROM avg_bids;
    

            item        |    average_bid
    --------------------+--------------------
     Custom Art         |  49.76620397600282
     Gift Basket        | 49.850028105677346
     City Bar Crawl     |  50.08233974737339
     Best Pizza in Town |  50.46824567514223
     Signed Memorabilia |  50.12977674688315
    (5 rows)
    
    Time: 846.322 ms
    

    Latency is high again because the index you created on the view exists only inside the compute_qck cluster. In the default cluster, where you are currently, you don’t have access to the index’s pre-computed results. Instead, the view once again retrieves data from durable storage and computes the results at query-time.

  8. In many cases, you’ll want results to be accessible from multiple clusters, however. To achieve this, you use materialized views.

    Like an index, a materialized view incrementally computes the results of a query as new data arrives. But unlike an index, a materialized view persists its results to durable storage that is accessible to all clusters.

    To see this in action, confirm that you are in the default cluster and then create a materialized view:

    SHOW CLUSTER;
    

     cluster
    ---------
     default
    (1 row)
    
    CREATE MATERIALIZED VIEW num_bids AS
      SELECT auctions.item, count(bids.id) AS number_of_bids
      FROM bids
      JOIN auctions ON bids.auction_id = auctions.id
      WHERE bids.bid_time < auctions.end_time
      GROUP BY auctions.item;
    

    The SELECT in this materialized view joins data from auctions and bids, but this time to get the number of eligible bids per item.

  9. Switch to the compute_qck cluster and query the materialized view:

    SET CLUSTER = compute_qck;
    
    SELECT * FROM num_bids;
    

            item        | number_of_bids
    --------------------+----------------
     Custom Art         |          10634
     Gift Basket        |          11266
     City Bar Crawl     |          10292
     Best Pizza in Town |          10498
     Signed Memorabilia |          10801
    (5 rows)
    
    Time: 790.384 ms
    

    As you can see, although the materialized view was created in the default cluster, its results are available from other clusters as well because they are in shared, durable storage.

  10. If retrieving a materialized view’s results from storage is too slow, you can create an index on the materialized view as well:

    CREATE INDEX num_bids_idx ON num_bids (item);
    
    SELECT * FROM num_bids;
    

            item        | number_of_bids
    --------------------+----------------
     Custom Art         |          14373
     Gift Basket        |          15271
     City Bar Crawl     |          14294
     Best Pizza in Town |          14606
     Signed Memorabilia |          14843
    (5 rows)
    
    Time: 32.064 ms
    

    Now that the materialzed view serves results from memory, latency is low again.

Step 5. Survive failures

Earlier, when you created your clusters, you gave each cluster one replica, that is, one physical resource. For the ingest_qck cluster, that’s the max number of replicas allowed, as clusters for sources can have only one replica. For the compute_qck cluster, however, you can increase the number of replicas for greater tolerance to replica failure.

Each replica in a non-source cluster is a logical clone, doing the same computation and holding the same results in memory. This design provides Materialize with active replication, and so long as one replica is still reachable, the cluster continues making progress.

Let’s see this in action.

  1. Add a second replica to the compute_qck cluster:

    CREATE CLUSTER REPLICA compute_qck.r2 SIZE = '2xsmall';
    
  2. Check the status of the new replica:

    SHOW CLUSTER REPLICAS WHERE CLUSTER = 'compute_qck';
    

       cluster   | replica |  size   | ready
    -------------+---------+---------+-------
     compute_qck | r1      | 2xsmall | t
     compute_qck | r2      | 2xsmall | t
    (2 rows)
    
  3. Once the r2 replica is ready (ready=t), drop the r1 replica to simulate a failure:

    DROP CLUSTER REPLICA compute_qck.r1;
    
  4. Query the indexed view that you created in compute_qck earlier:

    SELECT * FROM avg_bids;
    

            item        |    average_bid
    --------------------+--------------------
     Custom Art         | 49.770776201263864
     Gift Basket        |   49.8909070204407
     City Bar Crawl     | 50.056635368698046
     Best Pizza in Town |  50.50023551577956
     Signed Memorabilia |  50.11854192264935
    (5 rows)
    
    Time: 23.537 ms
    

    As you can see, the results are available despite the failure of one of the cluster’s replicas.

Step 6. Scale up or down

In addition to using replicas to increase fault tolerance, you can add and remove replicas to scale resources up or down according to the needs of a cluster. For example, let’s say the 2xsmall replica in the compute_qck cluster is running low on memory.

  1. Add the next largest replica, xsmall:

    CREATE CLUSTER REPLICA compute_qck.r3 SIZE = 'xsmall';
    
  2. Use the SHOW CLUSTER REPLICAS command to check the status of the new replica:

    SHOW CLUSTER REPLICAS WHERE CLUSTER = 'compute_qck';
    

       cluster   | replica |  size   | ready
    -------------+---------+---------+-------
     compute_qck | r2      | 2xsmall | t
     compute_qck | r3      | xsmall  | t
    (2 rows)
    
  3. Once the r3 replica is ready (ready=t), it’s safe to remove the r2 replica:

    DROP CLUSTER REPLICA compute_qck.r2;
    
  4. Again, because replicas in a cluster are logical clones, the new replica returns results just like the old replica:

    SELECT * FROM avg_bids;
    

            item        |    average_bid
    --------------------+-------------------
     Custom Art         | 49.82171985815603
     Gift Basket        | 49.81225672519678
     City Bar Crawl     | 50.16030259365994
     Best Pizza in Town | 50.32564214192425
     Signed Memorabilia | 49.96557507282196
    (5 rows)
    
    Time: 31.655 ms
    

Step 7. Clean up

Once you’re done exploring the auction house source, remember to clean up your environment:

DROP SCHEMA qck CASCADE;
DROP CLUSTER ingest_qck;
DROP CLUSTER compute_qck;
RESET search_path;
RESET cluster;

What’s next?

Back to top ↑