The Cost of Querying
Each time you query a database you incur some cost. Your database will parse, validate, plan, optimize, and execute your query, using up wall clock time, CPU time, memory, opportunity cost, and, potentially, actual dollars. As you grow your application, you will likely wonder how you can reduce these costs.
To dig into this, let’s look at an example. Imagine you work at a company that sells widgets. Each time a widget is sold, you record some purchase information in a
purchase table that looks like this:
With this data, you can answer questions that are critical to your business: what’s our best selling widget? Which widgets need to be shipped today? And, the evergreen question: how much money have we made?
Each of these questions can be answered by querying the
purchase table directly. For example, you could calculate your total revenue with the following query:
When this query is run, your database will scan each row of the
purchase table to aggregate the values in the
amount column. Your database will have spent
X resources to return your result. (As mentioned before,
X is comprised of wall clock time, CPU time, memory, opportunity cost, and dollars.)
The good news is you’re happy to pay
X because this information is important and valuable to your business. The bad news is that you will pay
X each time your query is run. Because queries against tables and views do not reference previous results, they are forced to completely recalculate their results each time.
You might be wondering: is this a big deal? If your query is exploratory, ad hoc, or infrequent, it’s probably not. But what if you suddenly needed to calculate total revenue once an hour instead of once a day? Or once a second?
Suddenly, even a small
X starts to feel expensive. As your queries are run more frequently, process more data, or perform more complex computation, these costs may even feel prohibitive. So, what can you do instead?
Unlike queries that calculate their results from tables and views, queries that read from materialized views do not recalculate their results each time. To understand why, let’s create a materialized view together.
When this statement is executed, your database will run your query once to calculate the result. Then, it will physically store that result in a newly created database object -- in our case an object named
total_revenue. Now when you query
total_revenue, your database will return the results stored in
total_revenue without performing any additional computation.
This means that unlike querying a view directly, you only pay the cost of your query once when creating a materialized view.* Future queries will pay some constant cost
1 to read from
total_revenue, instead of
This seems like a perfect solution for our problem, so what’s the catch? Why don’t we use materialized views for everything?
The first catch is that you pay for materialized views with memory. While materialized views physically storing prior results reduces your query cost from
1, it also requires memory that was not required to simply query a table or a logical view.
The second, trickier catch surfaces when a new purchase is added to the
purchase table. To illustrate, let’s walk through this scenario together.
Noticing that we have a new sale in our
purchase table, we query our materialized view for our updated total revenue. But, we get back our old value. What gives?
When we created our materialized view, our database ran our query once to store the results in
total_revenue. However, since the query was run before the newest purchase was added to the
purchase table, the result calculated and stored in
total_revenue does not take this new purchase into account. To update the result to include new purchases, you will have to refresh the materialized view. Generally, there are two mechanisms for refreshing materialized views: complete refreshes or incremental refreshes.
Complete refreshes rerun the query underlying a materialized view to completely recalculate and replace the stored results. Just like queries against tables and views, every complete refresh will cost
X. To complicate matters, you will have to decide when to perform a refresh: when someone queries the view? Once an hour? Once a week? You will want to choose a cadence that does not force the database to do unnecessary work. For example, you don't want to pay
total_revenue if there are no new purchases. In real world systems, users will have to choose a refresh schedule that maintains a careful balance between data freshness requirements and acceptable query costs.
By contrast, incremental refreshes keep materialized views up to date by only performing work only on data that has changed. To return to our example, instead of rescanning each row of the
purchase table to calculate the sum, an incrementally maintained materialized view would only do the following work:
This means that instead of paying
X to refresh the view, your database will only pay a fraction of
X proportional to the size of the changed data. And, because this refresh mechanism is triggered by changes in the underlying data, you can be certain that your database is refreshing your view no more than necessary. In our example,
total_revenue will only be updated when a row in the
purchase table is inserted, updated, or deleted.
Creating and maintaining a materialized view can reduce the query costs paid for expensive or frequently run queries. However, the potential cost reduction of a materialized view greatly depends on its underlying refresh mechanism. Only incremental refreshes can reduce the per-refresh cost of maintaining a materialized view while also ensuring that views are updated no more than necessary.
While incremental refreshes are available in a handful of databases, they often come with a long list of restrictions. At the moment, Materialize is the only system that supports incremental maintained materialized views without restriction. If you identified with the pain points in this blog post, check us out!