GCP BigLake

View as Markdown
PREVIEW This feature is in private preview. It is under active development and may have stability or performance issues. It isn't subject to our backwards compatibility guarantees.
To enable this feature in your Materialize region, contact our team.
WARNING!

Iceberg tables managed by the Lakehouse runtime catalog (BigLake API) do not receive automated maintenance like expiring old snapshots and compacting manifests and data files.

Without table maintenance, table metadata grows over time and will eventually exceed BigLake’s allowed limit. This will prevent Materialize Iceberg sinks from committing new data.

Iceberg tables managed by BigQuery do receive automated maintenance, but only BigQuery can write to them.

This guide walks you through the steps required to set up Iceberg sinks in Materialize Cloud.

Prerequisites

Create the Iceberg catalog connection in Materialize

Step 1. Set up permissions in GCP

Materialize uses a Google Cloud service account to authenticate to BigLake.

  1. Create the service account.

  2. Grant the service account these roles on your project:

    • biglake.editor (BigLake Editor)
    • serviceusage.serviceUsageConsumer (Service Usage Consumer)
  3. Grant the service account this role on your Iceberg warehouse bucket:

    • storage.objectUser (Storage Object User)
  4. Create a service account key in JSON format.

  5. Base64-encode the entire JSON key (e.g. base64 < sa_key.json). In the next step, you will decode the resulting string in the CREATE SECRET statement. Encoding the key first and decoding it in the CREATE SECRET statement avoids escaping quotes and newlines in the SQL string literal.

Step 2. Create a GCP connection and Iceberg catalog connection in Materialize

The following example creates a GCP connection and an Iceberg catalog connection for Google Cloud BigLake:

-- Using the base64-encoded service account key (e.g. base64 < sa_key.json)
CREATE SECRET gcp_service_account_key
  AS decode('<base64-encoded service account key JSON>', 'base64');

-- Create a GCP connection that uses the service-account key.
CREATE CONNECTION gcp_connection TO GCP (
    SERVICE ACCOUNT KEY = SECRET gcp_service_account_key
);

-- Create the Iceberg catalog connection pointing to BigLake.
CREATE CONNECTION iceberg_catalog_connection TO ICEBERG CATALOG (
    CATALOG TYPE = 'rest',
    URL = 'https://biglake.googleapis.com/iceberg/v1/restcatalog',
    WAREHOUSE = 'gs://<bucket>',
    GCP CONNECTION = gcp_connection
);

Create the Iceberg sink in Materialize

In Materialize, you can sink from a materialized view, table, or source. Use CREATE SINK to create an Iceberg sink, replacing:

  • <sink_name> with a name for your sink.
  • <sink_cluster> with the name of your sink cluster.
  • <my_materialize_object> with the name of your materialized view, table, or source.
  • <my_iceberg_namespace> with your catalog namespace.
  • <my_iceberg_table> with the name of your Iceberg table. If the Iceberg table does not exist, Materialize creates the table. For details, see CREATE SINK reference page.
  • <commit_interval> with your commit interval (e.g., 60s). The commit interval specifies how frequently Materialize commits snapshots to Iceberg. The minimum commit interval is 1s. See Commit interval tradeoffs below.

For the full list of syntax options, see the CREATE SINK reference.

Upsert mode

In upsert mode, replace <key> with the column(s) that uniquely identify rows.

CREATE SINK <sink_name>
  IN CLUSTER <sink_cluster>
  FROM <my_materialize_object>
  INTO ICEBERG CATALOG CONNECTION iceberg_catalog_connection (
    NAMESPACE = '<my_iceberg_namespace>',
    TABLE = '<my_iceberg_table>'
  )
  KEY (<key>)
  MODE UPSERT
  WITH (COMMIT INTERVAL = '<commit_interval>');

Append mode

In append mode, no KEY clause is used. The Iceberg table includes all source columns plus _mz_diff (int) and _mz_timestamp (long).

CREATE SINK <sink_name>
  IN CLUSTER <sink_cluster>
  FROM <my_materialize_object>
  INTO ICEBERG CATALOG CONNECTION iceberg_catalog_connection (
    NAMESPACE = '<my_iceberg_namespace>',
    TABLE = '<my_iceberg_table>'
  )
  MODE APPEND
  WITH (COMMIT INTERVAL = '<commit_interval>');

Considerations

Commit interval tradeoffs

The COMMIT INTERVAL setting controls how frequently Materialize commits snapshots to your Iceberg table, making the data available to downstream query engines. This setting involves tradeoffs:

Shorter intervals (e.g., < 60s) Longer intervals (e.g., 5m)
Lower latency - data visible sooner in downstream systems Higher latency - data takes longer to appear
More small files - can degrade query performance over time Fewer, larger files - better query performance
More frequent snapshot commits - higher catalog overhead Less catalog overhead
Lower throughput efficiency Higher throughput efficiency

Recommendations:

  • For production, use intervals of 60s or longer
  • For batch analytics, use longer intervals (5m to 15m)
NOTE: Outside of development environments, commit intervals should be at least 60s. Short commit intervals increase catalog overhead and produce many small files. Small files will result in degraded query performance. It also increases load on the Iceberg metadata, which can result in a degraded catalog, and non-responsive queries.

Exactly-once delivery

Iceberg sinks provide exactly-once delivery. After a restart, Materialize resumes from the last committed snapshot without duplicating data.

Materialize stores progress information in Iceberg snapshot metadata properties (mz-frontier and mz-sink-version).

Type mapping

Materialize converts SQL types to Iceberg/Parquet types:

SQL type Iceberg type
boolean boolean
smallint, integer int
uint2 int
bigint long
uint4 long
uint8 decimal(20, 0)
real float
double precision double
numeric decimal(38, scale)
date date
time time (microsecond)
timestamp timestamp (microsecond)
timestamptz timestamptz (microsecond)
text, varchar string
bytea binary
uuid fixed(16)
jsonb string
interval string
int4range, int8range, numrange, daterange, tsrange, tstzrange struct (fields: lower, upper, lower_inclusive, upper_inclusive, empty)
record struct
list list
map map

Limitations

  • Iceberg tables managed by the Lakehouse runtime catalog (BigLake API) do not receive automated maintenance like expiring old snapshots and compacting manifests and data files.

    Without table maintenance, table metadata grows over time and will eventually exceed BigLake’s allowed limit. This will prevent Materialize Iceberg sinks from committing new data.

  • Iceberg tables managed by BigQuery do receive automated maintenance, but only BigQuery can write to them.
  • Partitioned tables are not supported.
  • Schema evolution of an Iceberg table is not supported. If the SINK FROM object’s schema changes, you must drop and recreate the sink.

Troubleshooting

Sink creation fails with “input compacted past resume upper”

This error occurs when the source data has been compacted beyond the point where the sink last committed. This can happen after a Materialize backup/restore operation. You may need to drop and recreate the sink, which will re-snapshot the entire source relation.

Commit conflicts

If another process modifies the Iceberg table while Materialize is committing, you may see commit conflict errors. Materialize will automatically retry, but if conflicts persist, ensure no other writers are modifying the same table.

Back to top ↑