GCP BigLake
View as MarkdownTo enable this feature in your Materialize region, contact our team.
Iceberg tables managed by the Lakehouse runtime catalog (BigLake API) do not receive automated maintenance like expiring old snapshots and compacting manifests and data files.
Without table maintenance, table metadata grows over time and will eventually exceed BigLake’s allowed limit. This will prevent Materialize Iceberg sinks from committing new data.
Iceberg tables managed by BigQuery do receive automated maintenance, but only BigQuery can write to them.
This guide walks you through the steps required to set up Iceberg sinks in Materialize Cloud.
Prerequisites
- Google Cloud project with the BigLake API enabled.
- Google Cloud Storage bucket to serve as the Iceberg warehouse.
- Lakehouse runtime catalog backed by your warehouse bucket. For Authentication method, you can select either End-user credentials or Credential vending mode. Materialize authenticates separately with a GCP service account key (provided in the next step), so both modes work.
- Namespace in the catalog.
Create the Iceberg catalog connection in Materialize
Step 1. Set up permissions in GCP
Materialize uses a Google Cloud service account to authenticate to BigLake.
-
Create the service account.
-
Grant the service account these roles on your project:
biglake.editor(BigLake Editor)serviceusage.serviceUsageConsumer(Service Usage Consumer)
-
Grant the service account this role on your Iceberg warehouse bucket:
storage.objectUser(Storage Object User)
-
Base64-encode the entire JSON key (e.g.
base64 < sa_key.json). In the next step, you will decode the resulting string in theCREATE SECRETstatement. Encoding the key first and decoding it in theCREATE SECRETstatement avoids escaping quotes and newlines in the SQL string literal.
Step 2. Create a GCP connection and Iceberg catalog connection in Materialize
The following example creates a GCP connection and an Iceberg catalog connection for Google Cloud BigLake:
-- Using the base64-encoded service account key (e.g. base64 < sa_key.json)
CREATE SECRET gcp_service_account_key
AS decode('<base64-encoded service account key JSON>', 'base64');
-- Create a GCP connection that uses the service-account key.
CREATE CONNECTION gcp_connection TO GCP (
SERVICE ACCOUNT KEY = SECRET gcp_service_account_key
);
-- Create the Iceberg catalog connection pointing to BigLake.
CREATE CONNECTION iceberg_catalog_connection TO ICEBERG CATALOG (
CATALOG TYPE = 'rest',
URL = 'https://biglake.googleapis.com/iceberg/v1/restcatalog',
WAREHOUSE = 'gs://<bucket>',
GCP CONNECTION = gcp_connection
);
Create the Iceberg sink in Materialize
In Materialize, you can sink from a materialized view, table, or source. Use
CREATE SINK to create an Iceberg sink, replacing:
<sink_name>with a name for your sink.<sink_cluster>with the name of your sink cluster.<my_materialize_object>with the name of your materialized view, table, or source.<my_iceberg_namespace>with your catalog namespace.<my_iceberg_table>with the name of your Iceberg table. If the Iceberg table does not exist, Materialize creates the table. For details, seeCREATE SINKreference page.<commit_interval>with your commit interval (e.g.,60s). The commit interval specifies how frequently Materialize commits snapshots to Iceberg. The minimum commit interval is1s. See Commit interval tradeoffs below.
For the full list of syntax options, see the CREATE SINK reference.
Upsert mode
In upsert mode, replace <key> with the column(s) that uniquely identify rows.
CREATE SINK <sink_name>
IN CLUSTER <sink_cluster>
FROM <my_materialize_object>
INTO ICEBERG CATALOG CONNECTION iceberg_catalog_connection (
NAMESPACE = '<my_iceberg_namespace>',
TABLE = '<my_iceberg_table>'
)
KEY (<key>)
MODE UPSERT
WITH (COMMIT INTERVAL = '<commit_interval>');
Append mode
In append mode, no KEY clause is used. The Iceberg table includes all source
columns plus _mz_diff (int) and _mz_timestamp (long).
CREATE SINK <sink_name>
IN CLUSTER <sink_cluster>
FROM <my_materialize_object>
INTO ICEBERG CATALOG CONNECTION iceberg_catalog_connection (
NAMESPACE = '<my_iceberg_namespace>',
TABLE = '<my_iceberg_table>'
)
MODE APPEND
WITH (COMMIT INTERVAL = '<commit_interval>');
Considerations
Commit interval tradeoffs
The COMMIT INTERVAL setting controls how frequently Materialize commits
snapshots to your Iceberg table, making the data available to downstream query
engines. This setting involves tradeoffs:
Shorter intervals (e.g., < 60s) |
Longer intervals (e.g., 5m) |
|---|---|
| Lower latency - data visible sooner in downstream systems | Higher latency - data takes longer to appear |
| More small files - can degrade query performance over time | Fewer, larger files - better query performance |
| More frequent snapshot commits - higher catalog overhead | Less catalog overhead |
| Lower throughput efficiency | Higher throughput efficiency |
Recommendations:
- For production, use intervals of
60sor longer - For batch analytics, use longer intervals (
5mto15m)
60s.
Short commit intervals increase catalog overhead and produce many small files.
Small files will result in degraded query performance. It also increases load on
the Iceberg metadata, which can result in a degraded catalog, and non-responsive
queries.
Exactly-once delivery
Iceberg sinks provide exactly-once delivery. After a restart, Materialize resumes from the last committed snapshot without duplicating data.
Materialize stores progress information in Iceberg snapshot metadata
properties (mz-frontier and mz-sink-version).
Type mapping
Materialize converts SQL types to Iceberg/Parquet types:
| SQL type | Iceberg type |
|---|---|
boolean |
boolean |
smallint, integer |
int |
uint2 |
int |
bigint |
long |
uint4 |
long |
uint8 |
decimal(20, 0) |
real |
float |
double precision |
double |
numeric |
decimal(38, scale) |
date |
date |
time |
time (microsecond) |
timestamp |
timestamp (microsecond) |
timestamptz |
timestamptz (microsecond) |
text, varchar |
string |
bytea |
binary |
uuid |
fixed(16) |
jsonb |
string |
interval |
string |
int4range, int8range, numrange, daterange, tsrange, tstzrange |
struct (fields: lower, upper, lower_inclusive, upper_inclusive, empty) |
record |
struct |
list |
list |
map |
map |
Limitations
-
Iceberg tables managed by the Lakehouse runtime catalog (BigLake API) do not receive automated maintenance like expiring old snapshots and compacting manifests and data files.
Without table maintenance, table metadata grows over time and will eventually exceed BigLake’s allowed limit. This will prevent Materialize Iceberg sinks from committing new data.
- Iceberg tables managed by BigQuery do receive automated maintenance, but only BigQuery can write to them.
- Partitioned tables are not supported.
-
Schema evolution of an Iceberg table is not supported. If the
SINK FROMobject’s schema changes, you must drop and recreate the sink.
Troubleshooting
Sink creation fails with “input compacted past resume upper”
This error occurs when the source data has been compacted beyond the point where the sink last committed. This can happen after a Materialize backup/restore operation. You may need to drop and recreate the sink, which will re-snapshot the entire source relation.
Commit conflicts
If another process modifies the Iceberg table while Materialize is committing, you may see commit conflict errors. Materialize will automatically retry, but if conflicts persist, ensure no other writers are modifying the same table.