Materialize Documentation
Join the Community github/materialize

CREATE SOURCE: JSON from an S3 bucket

CREATE SOURCE connects Materialize to an external data source and lets you interact with its data as if the data were in a SQL table.

This document details how to connect Materialize to an S3 Bucket that contains multiple objects, and to listen for new object creation. Each S3 object can contain multiple records serialized as JSON, separated by newlines.

Conceptual framework

Sources represent connections to resources outside Materialize that it can read data from. For more information, see API Components: Sources.




WITH ( field = val , )
Field Use
MATERIALIZED Materializes the source’s data, which retains all data in memory and makes sources directly selectable. For more information, see API Components: Materialized sources.
src_name The name for the source, which is used as its table name within SQL.
col_name Override default column name with the provided identifier. If used, a col_name must be provided for each column in the created source.
DISCOVER OBJECTS Describes how to discover keys to download. See Object discovery strategies.
BUCKET SCAN bucket_name Materialize will scan the bucket to find the set of objects to download.
SQS NOTIFICATIONS queue_name Materialize will subscribe to the specified queue and download new objects. See Listening to SQS notifications.
MATCHING pattern A glob-style pattern to filter objects to ingest. See Patterns. Default is to ingest all objects.
COMPRESSION algorithm The compression algorithm used to decode downloaded objects. See Compression.
WITH ( option_list ) Options affecting source creation. For more detail, see WITH options.
TEXT Format the source’s data as ASCII-encoded text.
FORMAT BYTES Leave data received from the source as unformatted bytes stored in a column named data.
ENVELOPE NONE (Default) Use an append-only envelope. This means that records will only be appended and cannot be updated or deleted.

WITH options

The following options are valid within the WITH clause.

Field Value type Description
region text required A valid AWS region.

AWS Credentials WITH options

Field Value Description
access_key_id text A valid access key ID for the AWS resource.
secret_access_key text A valid secret access key for the AWS resource.
token text The session token associated with the credentials, if the credentials are temporary

If you do not provide credentials via with options then materialized will examine the standard AWS authorization chain:

  1. Environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
  2. credential_process command in the AWS config file, usually located at ~/.aws/config.
  3. AWS credentials file. Usually located at ~/.aws/credentials.
  4. IAM instance profile. Will only work if running on an EC2 instance with an instance profile/role.

Credentials fetched from a container or instance profile expire on a fixed schedule. Materialize will attempt to refresh the credentials automatically before they expire, but the source will become inoperable if the refresh operation fails.

Permissions Required

The IAM User or Role used by materialized requires permission to perform different AWS actions depending on which actions are required to discover the list of objects to download.

The DISCOVER OBJECTS USING clause describes how Materialize will load objects, and so its parameters determine which permissions materialized requires. For example, since the SCAN key name source (as in DISCOVER OBJECTS USING BUCKET SCAN) must perform repeated ListObjects actions to create a list of key names to download, you must grant the Materialize IAM User or Role the ListObjects permission before you specify DISCOVER OBJECTS USING BUCKET SCAN.

Key name source Permissions required
All GetObject permission for the objects to be downloaded
BUCKET SCAN ListObject permission for the buckets to be scanned, unless the MATCHING pattern can only match a single object. In such cases, Materialize will perform only the necessary GetObject API call.
SQS NOTIFICATIONS ChangeMessageVisibility, DeleteMessage, GetQueueUrl, ReceiveMessage SQS Permissions for the queue Materialize will listen to

The root AWS documentation for S3 permissions is available here.


S3 source details

The S3 source is designed to ingest a large volume of static data from AWS’s Simple Storage Service, Amazon Web Services’ cloud object store. It is important to note that an object store behaves differently than a file system does, and that those differences affect how Materialize interacts with objects in it.

Some key differences from file systems include:

The primary effect of this is that we do not handle updates to S3 objects, and we may interleave multiple object ingestion to speed it up.

S3 limitations

Object discovery strategies

Materialize has several techniques to discover the objects to download, with more planned for the future.

All strategies follow the same basic pattern. Materialize will:

No guarantees are made about the relative order of records sent from the object discovery process and the Materialize SQL engine. As usual with SQL, you must impose your own order.

You may specify strategies multiple times within one CREATE SOURCE statement. For example, this is a legal invocation:

  BUCKET SCAN 'example',
  BUCKET SCAN 'other',
  SQS NOTIFICATIONS 'example-notifications'
Listing Bucket objects

The BUCKET SCAN discovery performs a single scan over the specified bucket at source creation time. If you would like an S3 source to ingest objects that are added to the bucket after the source is created you must also configure an SQS NOTIFICATIONS discovery mechanism on the source.

Listening to SQS notifications

AWS S3 has a built-in method for notifying downstream applications of bucket modification, the Event Notification API. For Materialize, the only interesting modifications are object creation, aka the s3:ObjectCreated:* event namespace. Follow the AWS Tutorial to configure a bucket for exactly this namespace.

Once you have configured S3 notifications to go to an SQS queue, you can point Materialize at that queue with DISCOVER OBJECTS USING SQS NOTIFICATIONS 'queue-name'.

Materialize deletes SQS messages as soon as they are ingested. This means that the same SQS queue cannot be used for multiple sources. If you would like to have multiple sources listening to notifications from the same bucket you must configure an SNS topic as an intermediary, with multiple SQS queues subscribed to it, one SQS queue per source. Note that SQS queues subscribed to SNS topics intended for Materialize must be configured to use raw message delivery.

Since Materialize treats unmaterialized sources with multiple downstream views as separate sources, SQS NOTIFICATIONS should not be used with unmaterialized sources. This behavior of unmaterialized sources is considered a bug (#7423), and will be improved in a future release.


It is possible to filter the list of object keys to download using unix-style glob syntax as an argument to the MATCHING clause:

Pattern examples
Pattern Example Matches Example Excludes
** a , a/b/c.json none
2020/**/*.json 2020/11/uuid.json data/2020/uuid.json , 2020/11/uuid.csv
* a a/b
202{0,1}/*/*.csv 2020/11/data.csv 2022/11/data.csv , 2020/11/01/data.csv


Text format details

Text-formatted sources reads lines from a file.

Raw byte format details

Raw byte-formatted sources provide Materialize the raw bytes received from the source without applying any formatting or decoding.

Raw byte-formatted sources have one column, which, by default, is named data.

Append-only envelope

Append-only envelope means that all records received by the source is treated as an insert. This is Materialize’s default envelope (i.e. if no envelope is specified), and can be specified with ENVELOPE NONE.


Assuming there is an S3 bucket “analytics” that contains the following keys and associated content:


{"user_id": 9999, "disks_used": 2, "cpu_used_minutes": 3600}
{"user_id": 888, "disks_used": 0}
{"user_id": 777, "disks_used": 25, "cpu_used_minutes": 9200}


{"user_id": 9999, "disks_used": 150, "cpu_used_minutes": 400100}
{"user_id": 888, "disks_used": 0}
{"user_id": 777, "disks_used": 12, "cpu_used_minutes": 999900}

We can load all these keys with the following command:

> CREATE SOURCE json_source
  WITH (region = 'us-east-2')

This creates a source that…

To access the data as JSON we can use standard JSON functions and operators:

  SELECT>>user_id AS user_id,>>disks_used AS disks_used,>>cpu_used_minutes AS cpu_used_minutes
  FROM (SELECT text::JSONB AS data FROM json_source) AS jsonified;

This creates a source that…

Did this info help?
Yes No