CREATE SOURCE: Text or bytes from an S3 bucket
CREATE SOURCE
connects Materialize to an external data source and lets you interact
with its data as if the data were in a SQL table.
This document details how to connect Materialize to an S3 Bucket that contains multiple objects, and to listen for new object creation. Each S3 object can contain multiple byte- or text-encoded records, separated by newlines.
Conceptual framework
Sources represent connections to resources outside Materialize that it can read data from. For more information, see API Components: Sources.
Syntax
Field | Use |
---|---|
MATERIALIZED | Materializes the source’s data, which retains all data in memory and makes sources directly selectable. For more information, see Materialized source details. |
src_name | The name for the source, which is used as its table name within SQL. |
col_name | Override default column name with the provided identifier. If used, a col_name must be provided for each column in the created source. |
DISCOVER OBJECTS | Describes how to discover keys to download. See Object discovery strategies. |
BUCKET SCAN bucket_name | Materialize will scan the bucket to find the set of objects to download. |
SQS NOTIFICATIONS queue_name | Materialize will subscribe to the specified queue and download new objects. See Listening to SQS notifications. |
MATCHING pattern | A glob-style pattern to filter objects to ingest. See Patterns. Default is to ingest all objects. |
COMPRESSION algorithm | The compression algorithm used to decode downloaded objects. See Compression. |
WITH ( option_list ) | Options affecting source creation. For more detail, see WITH options. |
REGEX regex | Format the source’s data as a string, applying regex, whose capture groups define the columns of the relation. For more detail, see Regex format details. |
TEXT | Format the source’s data as ASCII-encoded text. |
FORMAT BYTES | Leave data received from the source as unformatted bytes stored in a column named data . |
ENVELOPE NONE | (Default) Use an append-only envelope. This means that records will only be appended and cannot be updated or deleted. |
WITH
options
The following options are valid within the WITH
clause.
Field | Value type | Description |
---|---|---|
region |
text |
required A valid AWS region. |
AWS Credentials WITH
options
Field | Value | Description |
---|---|---|
access_key_id |
text |
A valid access key ID for the AWS resource. |
secret_access_key |
text |
A valid secret access key for the AWS resource. |
token |
text |
The session token associated with the credentials, if the credentials are temporary |
If you do not provide credentials via with options then materialized
will examine the standard
AWS authorization chain:
- Environment variables:
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
credential_process
command in the AWS config file, usually located at~/.aws/config
.- AWS credentials file. Usually located at
~/.aws/credentials
. - IAM instance profile. Will only work if running on an EC2 instance with an instance profile/role.
Credentials fetched from a container or instance profile expire on a fixed schedule. Materialize will attempt to refresh the credentials automatically before they expire, but the source will become inoperable if the refresh operation fails.
Permissions Required
The IAM User or Role used by materialized
requires permission to perform different
AWS actions depending on which actions are required to discover
the list of objects to download.
The DISCOVER OBJECTS USING
clause describes how Materialize will load objects, and so its parameters
determine which permissions materialized
requires. For example, since the SCAN
key name
source (as in DISCOVER OBJECTS USING BUCKET SCAN
) must perform repeated ListObjects
actions to create a list
of key names to download, you must grant the Materialize IAM User or Role the ListObjects
permission before you specify DISCOVER OBJECTS USING BUCKET SCAN
.
Key name source | Permissions required |
---|---|
All | GetObject permission for the objects to be downloaded |
BUCKET SCAN | ListObject permission for the buckets to be scanned, unless the MATCHING pattern can only match a single object. In such cases, Materialize will perform only the necessary GetObject API call. |
SQS NOTIFICATIONS | GetMessage , GetQueueUrl SQS Permissions for the queue Materialize will listen to |
The root AWS documentation for S3 permissions is available here.
Details
Materialized source details
Materializing a source keeps data it receives in an in-memory
index, the presence of which makes the
source directly queryable. In contrast, non-materialized sources cannot process
queries directly; to access the data the source receives, you need to create
materialized views that SELECT
from the
source.
For a mental model, materializing the source is approximately equivalent to creating a non-materialized source, and then creating a materialized view from all of the source’s columns:
CREATE SOURCE src ...;
CREATE MATERIALIZED VIEW src_view AS SELECT * FROM src;
The actual implementation of materialized sources differs, though, by letting you refer to the source’s name directly in queries.
For more details about the impact of materializing sources (and implicitly
creating an index), see CREATE INDEX
: Details — Memory
footprint.
S3 source details
The S3 source is designed to ingest a large volume of static data from AWS’s Simple Storage Service, Amazon Web Services’ cloud object store. It is important to note that an object store behaves differently than a file system does, and that those differences affect how Materialize interacts with objects in it.
Some key differences from file systems include:
- Latency on individual S3 operations is much higher than on even cloud-network filesystems.
- It is not possible to seek to a point in an object; Materialize must download the entire object from the beginning to read to a point.
- All object operations are atomic. It is not possible to modify just part of an S3 object; you can only entirely replace or delete objects.
The primary effect of this is that we do not handle updates to S3 objects, and we may interleave multiple object ingestion to speed it up.
S3 limitations
- Currently S3 sources do not support Avro- or Protobuf- encoded objects. Implementation issues: Avro, Protobuf
- Object ingest order is not guaranteed.
- All S3 sources are append-only. Deleted and updated S3 objects are silently ignored.
Object discovery strategies
Materialize has several techniques to discover the objects to download, with more planned for the future.
All strategies follow the same basic pattern. Materialize will:
- Obtain a list of objects using AWS APIs.
- Deduplicate objects so the same object is never downloaded twice.
- Filter the list of objects against the MATCHING clause patterns.
- Download the matching objects.
- Treat each object downloaded as a newline-delimited file for the purposes of record delineation.
No guarantees are made about the relative order of records sent from the object discovery process and the Materialize SQL engine. As usual with SQL, you must impose your own order.
You may specify strategies multiple times within one
CREATE SOURCE
statement. For example, this is a legal invocation:
DISCOVER OBJECTS USING
BUCKET SCAN 'example',
BUCKET SCAN 'other',
SQS NOTIFICATIONS 'example-notifications'
Listening to SQS notifications
AWS S3 has a built-in method for notifying downstream applications of bucket modification, the
Event Notification API. For Materialize, the only interesting
modifications are object creation, aka the s3:ObjectCreated:*
event namespace. Follow the
AWS Tutorial to configure a bucket for exactly this namespace.
Once you have configured S3 notifications to go to an SQS queue, you can point Materialize at that
queue with DISCOVER OBJECTS USING SQS NOTIFICATIONS 'queue-name'
.
Patterns
It is possible to filter the list of object keys to download using unix-style glob syntax as an
argument to the MATCHING
clause:
?
matches any single character except/
.*
matches zero or more characters except/
.**
recursively matches directories, but some other pattern must be specified. For example,a/**
matches anything inside of thea/
prefix (but nota/
itself), and**/a
matchesa
in any prefix, but nota
with no prefix.{a,b}
matchesa
orb
wherea
andb
are arbitrary glob patterns.[ab]
matchesa
orb
wherea
andb
are characters. Prepend!
to the matched characters to invert the match, e.g.[!ab]
matches any character besidesa
orb
.- You can escape metacharacters such as
*
and?
with character class notation. For example,[*]
matches*
.
Pattern examples
Pattern | Example Matches | Example Excludes |
---|---|---|
** |
a , a/b/c.json |
none |
2020/**/*.json |
2020/11/uuid.json |
data/2020/uuid.json , 2020/11/uuid.csv |
* |
a |
a/b |
202{0,1}/*/*.csv |
2020/11/data.csv |
2022/11/data.csv , 2020/11/01/data.csv |
Compression
- Omitting
COMPRESSION
is equivalent toCOMPRESSION NONE
. - Using
GZIP
compression requires the object be compressed using thegzip
algorithm or that it be a concatenation of multiplegzip
member streams.
Regex format details
Regex-formatted sources let you apply a structure to arbitrary strings passed in from file sources. This is particularly useful when processing unstructured log files.
- To parse regex strings, Materialize uses rust-lang/regex. For more detail, refer to its documented syntax.
- To create a column in the source, create a capture group, i.e. a parenthesized
expression, e.g.
([0-9a-f]{8})
.- Name columns by…
-
Using the col_name option when creating the source. The number of names provided must match the number of capture groups.
-
Creating named captured groups, e.g.
?P<offset>
in(?P<offset>[0-9a-f]{8})
creates a column namedoffset
.Unnamed capture groups are named
column1
,column2
, etc.
-
- Name columns by…
- We discard all data not included in a capture group. You can create
non-capturing groups using
?:
as the leading pattern in the group, e.g.(?:[0-9a-f]{4} ){8}
.
Text format details
Text-formatted sources reads lines from a file.
- Data from text-formatted sources is treated as newline-delimited.
- Data is assumed to be UTF-8 encoded, and discarded if it cannot be converted to UTF-8.
- Text-formatted sources have one column, which, by default, is named
text
.
Raw byte format details
Raw byte-formatted sources provide Materialize the raw bytes received from the source without applying any formatting or decoding.
Raw byte-formatted sources have one column, which, by default, is named data
.
Append-only envelope
Append-only envelope means that all records received by the source is treated as an insert. This is Materialize’s default envelope (i.e. if no envelope is specified), and can be specified with ENVELOPE NONE.
Examples
Assuming there is an S3 bucket named “frontend” that contains the following keys and associated content:
logs/2020/12/31/frontend.log
99.99.44.44 - - [12/31/2020:23:55:59 +0000] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests 22"
80.91.33.133 - - [12/31/2020:23:55:02 +0000] "GET /downloads/materialized HTTP/1.1" 304 0 "-" "Rust/reqwest 0.3"
173.203.139.108 - - [12/31/2020:23:55:07 +0000] "GET /wpadmin HTTP/1.1" 404 332 "-" "Firefox 9000"
173.203.139.108 - - [12/31/2020:23:55:14 +0000] "GET /downloads/materialized HTTP/1.1" 404 334 "-" "Python/Requests 22"
99.99.44.44 - - [12/31/2020:23:55:01 +0000] "GET /downloads/materialized HTTP/1.1" 304 0 "-" "Python/Requests 22"
80.91.33.133 - - [12/31/2020:23:55:41 +0000] "GET /downloads/materialized HTTP/1.1" 304 0 "-" "Rust/reqwest 0.3"
37.26.93.214 - - [12/31/2020:23:55:52 +0000] "GET /updates HTTP/1.1" 200 3318 "-" "Go 1.1 package http"
logs/2021/01/01/frontend.log
99.99.44.44 - - [01/01/2021:00:00:41 +0000] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests 22"
188.138.60.101 - - [01/01/2021:00:00:48 +0000] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests 22"
46.50.21.83 - - [01/01/2021:00:00:02 +0000] "GET /downloads/materialized HTTP/1.1" 304 0 "-" "Python/Requests 22.01"
99.99.44.44 - - [01/01/2021:00:00:25 +0000] "GET /downloads/materialized HTTP/1.1" 304 0 "-" "Python/Requests 22"
91.239.186.133 - - [01/01/2021:00:00:04 +0000] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests 22"
173.203.139.108 - - [01/01/2021:00:00:08 +0000] "GET /downloads/materialized HTTP/1.1" 304 0 "-" "Python/Requests 22"
80.91.33.133 - - [01/01/2021:00:00:04 +0000] "GET /downloads/materialized HTTP/1.1" 304 0 "-" "Rust/reqwest 0.3"
93.190.71.150 - - [01/01/2021:00:00:33 +0000] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests 22"
91.234.194.89 - - [01/01/2021:00:00:57 +0000] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests 22"
46.4.83.163 - - [01/01/2021:00:00:20 +0000] "GET /downloads/materialized HTTP/1.1" 304 0 "-" "Python/Requests 22"
173.203.139.108 - - [01/01/2021:00:00:39 +0000] "GET /downloads/materialized HTTP/1.1" 404 335 "-" "Python/Requests 22"
Example TEXT format
First we’ll create a source that just ingests all these logs, so that we can execute some quick and dirty analysis:
CREATE MATERIALIZED SOURCE frontend_logs
FROM S3 DISCOVER OBJECTS MATCHING 'logs/202?/**/*.log' USING BUCKET SCAN 'frontend'
WITH (region = 'us-east-2')
FORMAT TEXT;
With that, we will have one SQL row per line in both files, so we can for example query
for all the lines that match updates
, ordered by position that we encountered the line:
> SELECT mz_record, text FROM frontend_logs WHERE text LIKE '%updates%' ORDER BY mz_record;
mz_record | text
1 | 99.99.44.44 - - [12/31/2020:23:55:59] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests_22"
7 | 37.26.93.214 - - [12/31/2020:23:55:52] "GET /updates HTTP/1.1" 200 3318 "-" "Go_1.1_package_http"
8 | 99.99.44.44 - - [01/01/2021:00:00:41] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests_22"
9 | 188.138.60.101 - - [01/01/2021:00:00:48] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests_22"
12 | 91.239.186.133 - - [01/01/2021:00:00:04] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests_22"
15 | 93.190.71.150 - - [01/01/2021:00:00:33] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests_22"
16 | 91.234.194.89 - - [01/01/2021:00:00:57] "GET /updates HTTP/1.1" 200 10020 "-" "Python/Requests_22"
Example REGEX format
It’s generally more convenient to work with well-structured columnar data, though, so we can use the REGEX format specifier to extract useful data:
CREATE MATERIALIZED SOURCE frontend_logs
FROM S3 DISCOVER OBJECTS MATCHING 'logs/202?/**/*.log' USING BUCKET SCAN 'frontend'
WITH (region = 'us-east-2')
FORMAT REGEX '(?P<ip>[^ ]+) - - \[?P<dt>([^]]_)\] "(?P<method>\w+) (?P<path>[^ ]+)[^"]+" (?P<status>\d+) (?P<content_length>\d+) "-" "(?P<user_agent>[^"]+)"';
With that, we will have one SQL row per line in both files, so we can for example query
for all the lines that have /updates
as their exact path:
> SELECT dt, ip, user_agent FROM frontend_logs WHERE path = '/updates';
dt | ip | user_agent
01/01/2021:00:00:04 | 91.239.186.133 | Python/Requests 22
01/01/2021:00:00:33 | 93.190.71.150 | Python/Requests 22
01/01/2021:00:00:41 | 99.99.44.44 | Python/Requests 22
01/01/2021:00:00:48 | 188.138.60.101 | Python/Requests 22
01/01/2021:00:00:57 | 91.234.194.89 | Python/Requests 22
12/31/2020:23:55:52 | 37.26.93.214 | Go 1.1 package http
12/31/2020:23:55:59 | 99.99.44.44 | Python/Requests 22
Related Pages
- S3 with
CSV
/JSON
encoded data CREATE SOURCE
CREATE VIEW
SELECT