CREATE SOURCE: Local files
CREATE SOURCE
connects Materialize to an external system you want to read data from, and provides details about how to decode and interpret that data.
This page describes how to use Materialize to read from local files.
Syntax
format_spec
compression
key_constraint
with_options
Field | Use |
---|---|
MATERIALIZED | Materializes the source’s data, which retains all data in memory and makes sources directly selectable. For more information, see Key Concepts: Materialized sources. |
src_name | The name for the source, which is used as its table name within SQL. |
col_name | Override default column name with the provided identifier. If used, a col_name must be provided for each column in the created source. |
FILE path | The absolute path to the file you want to use as the source. |
COMPRESSION NONE | (Default) The file does not use a compression algorithm. |
COMPRESSION algorithm | The compression algorithm used with the file. Using GZIP compression requires the file to be a normal .gz file (e.g. one produced by gzip ) or a concatenation of multiple gzip member streams. |
ENVELOPE NONE | (Default) Use an append-only envelope. This means that records will only be appended and cannot be updated or deleted. |
WITH ( option_list ) | Options affecting source creation. For more detail, see WITH options. |
WITH
options
Field | Value | Description |
---|---|---|
tail |
boolean |
Continually check the file for new content. |
timestamp_frequency_ms |
int |
Default: 1000 . Sets the timestamping frequency in ms . Reflects how frequently the source advances its timestamp. This measure reflects how stale data in views will be. Lower values result in more-up-to-date views but may reduce throughput. |
Supported formats
Format |
Append-only envelope | Upsert envelope | Debezium envelope |
---|---|---|---|
Avro | ✓ | ✓ | |
JSON | ✓ | ||
Text/bytes | ✓ | ||
CSV | ✓ |
Features
Tailing files
In addition to reading static files and making the data available for processing downstream, Materialize can continually check a file for new lines and ingest them as they are appended. For a source to implement this behavior, you must configure it to use WITH (tail = true)
:
CREATE SOURCE csv_source
FROM FILE '/local/path/file.csv'
WITH (tail = true)
FORMAT CSV WITH HEADER;
Formatting text
If you’re dealing with unstructured text files, such as server logs, you can structure the content by providing a regular expression using the REGEX
format specifier. This lets you generate multiple columns from arbitrary lines of text,
given it has some consistent formatting:
CREATE SOURCE hex
FROM FILE '/xxd.log'
WITH (tail = true)
FORMAT REGEX '(?P<offset>[0-9a-f]{8}): (?:[0-9a-f]{4} ){8} (?P<decoded>.*)$';
It’s important to note that Materialize uses the Rust regex dialect to parse regex strings, which is similar but not identical to the PostgreSQL regex dialect. For details on the supported syntax, refer to the regex crate documentation.
Examples
Creating a source
To read from a local Avro Object Container File (OCF):
CREATE SOURCE avro_source
FROM AVRO OCF '/local/path/file.ocf'
WITH (tail = true);
This creates a source that…
- Automatically determines its schema from the OCF file’s embedded schema.
- Dynamically checks for new entries.
- Is append-only.
To read from a local JSON-formatted file:
CREATE SOURCE json_source
FROM FILE '/local/path/file.json'
FORMAT BYTES;
CREATE MATERIALIZED VIEW jsonified_file_source AS
SELECT
data->>'field1' AS field_1,
data->>'field2' AS field_2,
data->>'field3' AS field_3
FROM (SELECT CONVERT_FROM(data, 'utf8')::jsonb AS data FROM json_source);
As an example, assume we have xxd
creating hex dumps for some incoming files. Its output might look like this:
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............
00000010: 0300 3e00 0100 0000 105b 0000 0000 0000 ..>......[......
00000020: 4000 0000 0000 0000 7013 0200 0000 0000 @.......p.......
To create a source that takes in these entire lines and extracts the file offset, as well as the decoded value:
CREATE SOURCE hex
FROM FILE '/xxd.log'
WITH (tail = true)
FORMAT REGEX '(?P<offset>[0-9a-f]{8}): (?:[0-9a-f]{4} ){8} (?P<decoded>.*)$';
This creates a source that…
- Is append-only.
- Has two columns:
offset
anddecoded
. - Discards the second group (i.e.,
(?:[0-9a-f]{4} ){8}
). - Dynamically checks for new entries.
Using the above example, the source would generate data that looks similar to this:
offset | decoded
---------+------------------
00000000 | .ELF............
00000010 | ..>......[......
00000020 | @.......p.......
Static files
CREATE SOURCE static_w_header
FROM FILE '/local/path/file.csv'
FORMAT CSV WITH HEADER;
This creates a source that…
- Is append-only.
- Materialize reads once during source creation.
- Has as many columns as the first row of the CSV file, and uses that row to name the columns.
Dynamic files
CREATE SOURCE dynamic_wo_header (col_foo, col_bar, col_baz)
FROM FILE '/local/path/file.csv'
WITH (tail = true)
FORMAT CSV WITH 3 COLUMNS;
This creates a source that…
- Is append-only.
- Has 3 columns (
col_foo
,col_bar
,col_baz
). Materialize will not ingest any row without 3 columns. - Materialize dynamically checks for new entries.