Materialize Logo

Business Intelligence Demo

tl;dr Materialize can enable real-time monitoring within business intelligence tools, and we have a demo showing you that it’s feasible.

Traditionally, business intelligence tools rely on reading day-old answers from data warehouses. Given the substantial size of the data they’re required to process, it makes sense that you wouldn’t want to recompute the answer every time someone wants to view it.

Materialize, though, offers a totally different paradigm for powering BI tools that offers complex, real-time analysis over massive datasets. The fundamental idea is that Materialize persists the results of your queries, and then incrementally updates them as new data comes in. Unlike the traditional RDBMS model, which doesn’t rely on previously completed work, Materialize minimizes the time required to provide fresh answers.

Overview

In this demo, we’ll take the role of a business that has a transactional workload running against our servers. To support the business’ goals, we’ll stand up a BI tool to perform analytic-style queries over that transactional data.

BI tools often sit on top of complex infrastructure, which is unreasonable to spin up for a demo. To let you get a feel of what Materialize is like to use without an enormous investment of time, we’ve built out a Docker Compose file that can take care of the entire deployment for you.

In this section, we’ll cover the “what” and “why” of our proposed deployment using Materialize to provide real-time analytics within Metabase, an open source business intelligence tool.

Dataset & load generator

For this demo, Materialize uses a custom benchmark called chBench that is really just a concatenation of two well known database benchmarking tools, TPC-C and TPC-H.

TPC-C is an industry-standard benchmark for transactional workloads, meant to test a database’s scalability.

TPC-H is from the same group that developed TPC-C, and is meant to test the capabilities of online analytic platforms by performing complex analytic style queries. This includes large aggregations, many groupings, and complex multi-way joins.

CH-benCHmark brings together the a TPC-C-like dataset, with TPC-H’s analytical queries. This is a great approximation for how many businesses perform OLAP queries over OLTP data.

Database (MySQL)

This demo relies on MySQL, which is a stable, well supported platform with good performance.

Change Data Capture & Streaming (Debezium & Kafka)

As the chBench load generator writes data to your database, we need to propagate those changes to Materialize so it can update the answers to your queries.

With MySQL, the easiest way to do this is through a change data capture (CDC) tool, which can describe changes to your data. This demo relies on Debezium as its CDC tool.

To ferry the CDC data to Materialize, we stream the data using Kafka. This requires a suite of tools from Confluent, e.g. ZooKeeper, Confluent Schema Registry, etc, which we’ve also included in our demo’s Docker deployment.

Materialize

Materialize ingests the CDC data from MySQL, representing the chBench load generator’s continued activity.

With that data available, you can perform complex queries, and Materialize maintains their results, even as the underlying data changes. This means that you can get answers to your queries in real time, rather than relying on day-old answers from your data warehouse.

BI Tool (Metabase)

Metabase is an open source tool to create visualizations of SQL queries’ results, and then group them into dashboards. For instance, teams might use Metabase to monitor purchasing geographic purchasing patterns from their stores.

In our demo, we’ll use Metabase to visualize the results of some TPC-H-like queries, and watch the visualizations update as quickly as Metabase allows.

In thinking of this deployment in terms of client-server relationships, this BI tool represents a client, and Materialize represents a server.

Diagram

Putting this all together, our deployment looks like this:

Materialize deployment diagram with Metabase

Run the demo

Preparing the environment

  1. Set up Docker and Docker compose, if you haven’t already.

    Note for macOS users: Be sure to increase Docker resources to at least 2 CPUs and 8GB memory. Running Docker for Mac with less resources may cause the demo to fail.

  2. Clone the Materialize repository:

    git clone https://github.com/MaterializeInc/materialize.git
    

    You can also view the demo’s code on GitHub.

  3. Download and start all of the components we’ve listed above by running:

    bin/mzconduct run chbench --workflow demo-load
    

    Note that downloading the Docker images necessary for the demo can take quite a bit of time (upwards of 10 minutes, even on very fast connections).

Define sources & views

Now that our deployment is running (and looks like the diagram shown above), we can get Materialize to read data from Kafka, and define the views we want Materialize to maintain for us.

  1. Launch a new terminal window and cd <path to materialize>/demo/chbench.

  2. Launch the Materialize CLI (mzcli) by running:

    ./mzcompose run cli
    
  3. Within mzcli, ensure you have all of the necessary sources, which represent all of the tables from MySQL:

    SHOW SOURCES;
    
    customer
    district
    item
    nation
    neworder
    order
    orderline
    region
    stock
    supplier
    warehouse
    
  4. Create a straightforward view of the underlying data.

    CREATE MATERIALIZED VIEW query01 AS
        SELECT
            ol_number,
            sum(ol_quantity) as sum_qty,
            sum(ol_amount) as sum_amount,
            avg(ol_quantity) as avg_qty,
            avg(ol_amount) as avg_amount,
            count(*) as count_order
        FROM orderline
        WHERE ol_delivery_d > date '1998-12-01'
        GROUP BY ol_number;
    

    This is used to repesent “Query 01” in chBench, which tracks statistics about the TPC-C orderline table.

  5. Check the results of the view:

    SELECT * FROM query01;
    

    If you run this query multiple times, you should see the results change, and the answers should come back pretty quickly. (How quickly depends on the speed of your computer, but it should be some small fraction of a second.)

  6. Define another view for “Query 07”, which involves a complex 11-way JOIN across 6 tables:

    CREATE MATERIALIZED VIEW query07 AS
        SELECT
            su_nationkey AS supp_nation,
            substr(c_state, 1, 1) AS cust_nation,
            extract('year' FROM o_entry_d) AS l_year,
            sum(ol_amount) AS revenue
        FROM
            supplier,
            stock,
            orderline,
            order,
            customer,
            nation AS n1,
            nation AS n2
        WHERE
            ol_supply_w_id = s_w_id
            AND ol_i_id = s_i_id
            AND s_su_suppkey = su_suppkey
            AND ol_w_id = o_w_id
            AND ol_d_id = o_d_id
            AND ol_o_id = o_id
            AND c_id = o_c_id
            AND c_w_id = o_w_id
            AND c_d_id = o_d_id
            AND su_nationkey = n1.n_nationkey
            AND c_n_nationkey = n2.n_nationkey
        GROUP BY
            su_nationkey,
            substr(c_state, 1, 1),
            extract('year' FROM o_entry_d)
        ORDER BY su_nationkey, cust_nation, l_year;
    
  7. Check the results of this query:

    SELECT * FROM query07;
    

    This query can take a few minutes to begin producing answers. Before answers are produced, you will see the following error:

    SELECT * FROM query07;
    
    At least one input has no complete timestamps yet: { .. }
    

    If you receive an empty set of results, either wait or disconnect from mzcli and relaunch it. Just like the prior reads from the materialized view, you should see these results update, and the response times should be quick!

Set up Metabase

  1. In a browser, go to localhost:3030.

  2. Click Let’s get started.

  3. Complete first set of fields asking for your email address. This information isn’t crucial for anything but does have to be filled in.

  4. On the Add your data page, fill in the following information:

    Field Enter…
    Database Materialize
    Name tpcch
    Host materialized
    Port 6875
    Database name materialize
    Database username root
    Database password Leave empty.
  5. Proceed past the screens until you reach your primary dashboard.

Create dashboards

  1. Click Ask a question.

  2. Click Native query.

  3. From Select a database, select tpcch.

  4. In the query editor, enter:

    SELECT * FROM query01;
    

    Materialize relies on you already having created a materialized view with this name, which you did the Materialize CLI a few steps back.

    In a production setting, you might want to let users find views’ underlying queries. For example, you might store the underlying queries in a GitHub repository, and provide a link to the repository as a comment. Those with direct access to the materialized process through a SQL client can find the underlying queries using SHOW CREATE VIEW.

  5. Click the large > button.

    If you receive an error stating One or more sources has no closed timestamps, wait a few seconds, press enter in the query editor, and then repeat this process.

  6. Once you see the results similar to those below, click Save, enter query01 as the query’s name, and then click Save.

  7. When asked if you want to save this to a dashboard, click Yes please!.

  8. Click My personal collection, and then click Create a new dashboard.

  9. Enter tpcch as the Name, and then click Create.

  10. Modify the size of the query01 card, and then click Save.

  11. Click Auto-refresh, and then select 1 minute (which is the fastest refresh rate that the tool offers; Materialize can reasonably support 1 second refreshes for many query types).

At this point, Metabase will now automatically refresh this analysis for you every 1 minute.

If you want to see more chBench queries, you can repeat these steps for the view query07 or any of the queries listed in our chBench query index.

Recap

In this demo, we saw:

Details

Typical operation

In this demo, you “played” the role of both infrastructure engineer and business analyst; that is to say, you both deployed Materialize, as well as ran queries within your BI tool, Metabase.

To give you a sense of what this experience would look like if you were two different people, let’s highlight what the expected workflow is.

Of course, it’s possible to change this workflow to make it more self-service, etc. but this is a reasonable workflow.