No Classification without Represention

Michael Greenberg
March 25, 2026

It’s well known that type systems are an avenue to better performance in conventional programming languages. Recently, an overhaul of how Materialize’s optimizer treats types led to better performance in our dataflows compiled from SQL. Using a simpler type system in the optimizer lets us generate simpler code and share more arrangements, leading to some serious reductions in memory. (One customer saw a 25% reduction in memory usage!)

A string by any other name

Materialize presents itself using Postgres’s type system. Much of this type system is standard SQL stuff: exact numeric types (SMALLINT, BIGINT, etc.); datetime types (DATE, TIMESTAMP, etc.); string types (CHARACTER, CHARACTER VARYING, etc.); arrays (ARRAY). Postgres extends these types further, with type aliases (VARCHAR is a touch shorter than CHARACTER VARYING) and novel internal types (TEXT is shorter still!). The types specified in table and view schemas determine which operations we select: both SMALLINT and BIGINT support addition (+), but a SMALLINT exhibits different overflow behavior from a BIGINT; CHARACTER(n) and BPCHAR and VARCHAR(n) all treat trailing whitespace differently.

The Postgres documentation on strings calls out the differences between the types, stressing the fact that they are represented the same way in memory (modulo padding). Materialize makes the same choice: we represent TEXT and VARCHAR and VARCHAR(n) and CHAR(n) and BPCHAR using the same structure at runtime, which we call Datum::String. Like Postgres, we have to carefully generate the correct operations for the given types---string concatenation (||) behaves differently depending on the string type being used.

Casts: correct bookkeeping comes at a cost

Our compiler meticulously ensures that the type of a SQL term matches the type of the dataflow we generate and run. We run a typechecker at multiple points in the optimization process to catch type errors well in advance of running bad dataflows. In order to keep our string types straight, we need to insert casts when we move between them. Sometimes a cast does real work---going from unbounded TEXT to CHAR(8) may mean truncating the string---and sometimes it doesn’t---going from unbounded VARCHAR to unbounded TEXT is just bookkeeping.

Unfortunately, the bookkeeping casts aren’t free:

  1. we might do no-op work on each row of a large stream when ‘transforming’ data from VARCHAR to TEXT; and
  2. we might not share arrangements if casts get in the way: if we have an arrangement on t.x but need one on varchar_to_text(t.x), we will make a new arrangement (even though varchar_to_text is a no-op).

We don’t want to give up on precise type managment, but we also don’t want the overhead of pointless bookkeeping. How can we keep the safety of types while widening this performance bottleneck?

Representation types: distinctions mean differences

We made a new type system for our optimizer, called representation types. These types are noticeably simpler than the Postgres-style types, following the principle that types classify values, and so different types should exist only if they classify different values. Once we’ve generated the correct operations, we can trust our optimizer not to mess with them. We can therefore erase type distinctions that no longer matter---we can just use the representation type of strings, r_string, anywhere we would have used a Postgres/SQL-style string type.

Once the optimizer only cares about representation types, casts like varchar_to_text aren’t doing anything for anyone. We don’t need the type bookkeeping (the optimizer sees varchar_to_text as a function from r_string to r_string), and we don’t need the work (varchar_to_text is a no-op). So: out goes varchar_to_text! Anywhere we would have generated varchar_to_text(e) for some expression e, we can simply generate e on its own.

Simply eliding a no-op cast may not seem like much, but these casts were all over the place---and getting in the way of the optimizer doing other good work. If there’s no varchar_to_text

  • we can share more arrangements;
  • common-subexpression elimination can combine more terms, reducing redundant work; and
  • we can skip operators that would have only done no-op work (which opens up yet more opportunities for sharing).

Less busywork, more sharing

Finding the right type system frees up our optimizer to make more worthwhile changes. Much of the academic work on type systems focuses on making finer and finer distinctions; surprisingly, the right type system here is the coarser one, where we make fewer distinctions.

There are good performance improvements for workloads with heavy string conversion---which is not at all uncommon if you’re using a Postgres source. And there’s room for more improvements---we can make changes with internal representations in ways that wouldn’t work with Postgres’s type system! There are casts other than varchar_to_text we can eliminate; there are other types that are synonyms; there are more compact, efficient, and unified representations of various numeric types. There’s lots to look forward to!

Michael

Michael Greenberg

Technical Consultant, Materialize

Michael is an assistant professor of computer science at Stevens Institute of Technology, doing research in systems and programming languages. He worked previously at Pomona College, after a postdoc at Princeton and PhD from the University of Pennsylvania.