Materialize Logo

CREATE SOURCE: Text or bytes from local file

CREATE SOURCE connects Materialize to some data source, and lets you interact with its data as if it were in a SQL table.

This document details how to connect Materialize to text- or byte–formatted local files.

Conceptual framework

Sources represent connections to resources outside Materialize that it can read data from. For more information, see API Components: Sources.

Formatting text

If you have unstructured text files, such as server logs, you can impose a structure on them by providing a regular expression in the REGEX formatting option. This lets you generate multiple columns from arbitrary lines of text, given the text has some consistent structure.

If you want all of the text to come in as a single column, you can simply choose the TEXT formatting option.

Syntax

CREATE MATERIALIZED SOURCE IF NOT EXISTS src_name ( col_name , ) FROM FILE path COMPRESSION NONE GZIP WITH ( field = val , ) FORMAT REGEX regex TEXT BYTES
Field Use
MATERIALIZED Materializes the source’s data, which retains all data in memory and makes sources directly selectable. For more information, see Materialized source details.
src_name The name for the source, which is used as its table name within SQL.
col_name Override default column name with the provided identifier. If used, a col_name must be provided for each column in the created source.
FILE path The absolute path to the file you want to use as the source.
COMPRESSION algorithm The compression algorithm used with this file.
WITH ( option_list ) Options affecting source creation. For more detail, see WITH options.
REGEX regex Format the source’s data as a string, applying regex, whose capture groups define the columns of the relation. For more detail, see Regex format details.
TEXT Format the source’s data as ASCII-encoded text.
BYTES Format the source’s data as unformatted bytes.
ENVELOPE NONE (Default) Use an append-only envelope. This means that records will only be appended and cannot be updated or deleted.

WITH options

The following options are valid within the WITH clause.

Field Value type Description
tail boolean Continually check the file for new content.

Details

Materialized source details

Materializing a source keeps data it receives in an in-memory index, the presence of which makes the source directly queryable. In contrast, non-materialized sources cannot process queries directly; to access the data the source receives, you need to create materialized views that SELECT from the source.

For a mental model, materializing the source is approximately equivalent to creating a non-materialized source, and then creating a materialized view from all of the source’s columns:

CREATE SOURCE src ...;
CREATE MATERIALIZED VIEW src_view AS SELECT * FROM src;

The actual implementation of materialized sources differs, though, by letting you refer to the source’s name directly in queries.

For more details about the impact of materializing sources (and implicitly creating an index), see CREATE INDEX: Details — Memory footprint.

File source details

path values must be the file’s absolute path, e.g.

CREATE SOURCE server_source FROM FILE '/Users/sean/server.log'...

Compression

Regex format details

Regex-formatted sources let you apply a structure to arbitrary strings passed in from file sources. This is particularly useful when processing unstructured log files.

Text format details

Text-formatted sources reads lines from a file.

Raw byte format details

Raw byte-formatted sources provide Materialize the raw bytes received from the source without applying any formatting or decoding.

Raw byte-formatted sources have one column, which, by default, is named bytes.

Append-only envelope

Append-only envelope means that all records received by the source is treated as an insert. This is Materialize’s default envelope (i.e. if no envelope is specified), and can be specified with ENVELOPE NONE.

Examples

Creating a source from a dynamic, unstructured file

In this example, we’ll assume we have xxd creating hex dumps for some incoming files. Its output might look like this:

00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............
00000010: 0300 3e00 0100 0000 105b 0000 0000 0000  ..>......[......
00000020: 4000 0000 0000 0000 7013 0200 0000 0000  @.......p.......

We’ll create a source that takes in these entire lines and extracts the file offset, as well as the decoded value.

CREATE SOURCE hex
FROM FILE '/xxd.log'
WITH (tail = true)
FORMAT REGEX '(?P<offset>[0-9a-f]{8}): (?:[0-9a-f]{4} ){8} (?P<decoded>.*)$';

This creates a source that…

Using the above example, the source would generate data that looks similar to this:

 offset  |     decoded
---------+------------------
00000000 | .ELF............
00000010 | ..>......[......
00000020 | @.......p.......
Did this info help?
Yes No