Materialize Documentation
s
Join the Community github/materialize

CREATE SOURCE: Local files

CREATE SOURCE connects Materialize to an external system you want to read data from, and provides details about how to decode and interpret that data.

This page describes how to use Materialize to read from local files.

Syntax

CREATE MATERIALIZED SOURCE IF NOT EXISTS src_name ( col_name , , key_constraint ) FROM FILE path compression with_options FORMAT format_spec AVRO OCF path with_options ENVELOPE NONE DEBEZIUM

format_spec

REGEX regex CSV WITH HEADER ( col_name , ) n COLUMNS DELIMITED BY char TEXT BYTES

compression

COMPRESSION NONE GZIP

key_constraint

PRIMARY KEY ( col_name , ) NOT ENFORCED

with_options

WITH ( field = val , )
Field Use
MATERIALIZED Materializes the source’s data, which retains all data in memory and makes sources directly selectable. For more information, see Key Concepts: Materialized sources.
src_name The name for the source, which is used as its table name within SQL.
col_name Override default column name with the provided identifier. If used, a col_name must be provided for each column in the created source.
FILE path The absolute path to the file you want to use as the source.
COMPRESSION NONE (Default) The file does not use a compression algorithm.
COMPRESSION algorithm The compression algorithm used with the file. Using GZIP compression requires the file to be a normal .gz file (e.g. one produced by gzip) or a concatenation of multiple gzip member streams.
ENVELOPE NONE (Default) Use an append-only envelope. This means that records will only be appended and cannot be updated or deleted.
WITH ( option_list ) Options affecting source creation. For more detail, see WITH options.

WITH options

Field Value Description
tail boolean Continually check the file for new content.
timestamp_frequency_ms int Default: 1000. Sets the timestamping frequency in ms. Reflects how frequently the source advances its timestamp. This measure reflects how stale data in views will be. Lower values result in more-up-to-date views but may reduce throughput.

Supported formats

Format
Append-only envelope Upsert envelope Debezium envelope
Avro
JSON
Text/bytes
CSV

Features

Tailing files

In addition to reading static files and making the data available for processing downstream, Materialize can continually check a file for new lines and ingest them as they are appended. For a source to implement this behavior, you must configure it to use WITH (tail = true):

CREATE SOURCE csv_source
  FROM FILE '/local/path/file.csv'
  WITH (tail = true)
  FORMAT CSV WITH HEADER;

Formatting text

If you’re dealing with unstructured text files, such as server logs, you can structure the content by providing a regular expression using the REGEX format specifier. This lets you generate multiple columns from arbitrary lines of text, given it has some consistent formatting:

CREATE SOURCE hex
  FROM FILE '/xxd.log'
  WITH (tail = true)
  FORMAT REGEX '(?P<offset>[0-9a-f]{8}): (?:[0-9a-f]{4} ){8} (?P<decoded>.*)$';

It’s important to note that Materialize uses the Rust regex dialect to parse regex strings, which is similar but not identical to the PostgreSQL regex dialect. For details on the supported syntax, refer to the regex crate documentation.

Examples

Creating a source

To read from a local Avro Object Container File (OCF):

CREATE SOURCE avro_source
  FROM AVRO OCF '/local/path/file.ocf'
  WITH (tail = true);

This creates a source that…

  • Automatically determines its schema from the OCF file’s embedded schema.
  • Dynamically checks for new entries.
  • Is append-only.

To read from a local JSON-formatted file:

CREATE SOURCE json_source
  FROM FILE '/local/path/file.json'
  FORMAT BYTES;
CREATE MATERIALIZED VIEW jsonified_file_source AS
  SELECT
    data->>'field1' AS field_1,
    data->>'field2' AS field_2,
    data->>'field3' AS field_3
  FROM (SELECT CONVERT_FROM(data, 'utf8')::jsonb AS data FROM json_source);

As an example, assume we have xxd creating hex dumps for some incoming files. Its output might look like this:

00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............
00000010: 0300 3e00 0100 0000 105b 0000 0000 0000  ..>......[......
00000020: 4000 0000 0000 0000 7013 0200 0000 0000  @.......p.......

To create a source that takes in these entire lines and extracts the file offset, as well as the decoded value:

CREATE SOURCE hex
  FROM FILE '/xxd.log'
  WITH (tail = true)
  FORMAT REGEX '(?P<offset>[0-9a-f]{8}): (?:[0-9a-f]{4} ){8} (?P<decoded>.*)$';

This creates a source that…

  • Is append-only.
  • Has two columns: offset and decoded.
  • Discards the second group (i.e., (?:[0-9a-f]{4} ){8}).
  • Dynamically checks for new entries.

Using the above example, the source would generate data that looks similar to this:

 offset  |     decoded
---------+------------------
00000000 | .ELF............
00000010 | ..>......[......
00000020 | @.......p.......

Static files

CREATE SOURCE static_w_header
  FROM FILE '/local/path/file.csv'
  FORMAT CSV WITH HEADER;

This creates a source that…

  • Is append-only.
  • Materialize reads once during source creation.
  • Has as many columns as the first row of the CSV file, and uses that row to name the columns.

Dynamic files

CREATE SOURCE dynamic_wo_header (col_foo, col_bar, col_baz)
  FROM FILE '/local/path/file.csv'
  WITH (tail = true)
  FORMAT CSV WITH 3 COLUMNS;

This creates a source that…

  • Is append-only.
  • Has 3 columns (col_foo, col_bar, col_baz). Materialize will not ingest any row without 3 columns.
  • Materialize dynamically checks for new entries.