Twitter was, for a period, a really fascinating place to watch authentic human interaction unfold. You could crawl all the tweets, query a few APIs, or drink from the firehose. You can track trending hashtags, exploding viral content, and engaging conversations, all as it happens. Twitter eventually put the firehose behind a paywall, and .. it also stopped being Twitter.

One of the recent upsides is that relatively more folks are now on Bluesky, which does have a public firehose.

Today we’re going to pull data from the Bluesky firehose into Materialize. And great news, you can follow along by grabbing the recently released Community Edition. The link will still work by the end of the post, so if it sounds interesting enough at that point, have a click.

Let me stress now, having done this a few times: it is so much more compelling to see the data change than to look at results that I have copy/pasted. You’ll want to go and see the data changing live yourself! We’ll have some representative outputs, but one of the coolest things about live data is that it changes, and you’ll uncover a different narrative.


Dear reader: Social data is weird! Humans are weird! I’m intentionally capturing data as it is, unfiltered, but it means you might click on something and have that weirdness piped directly into your brain. This can be great, but know that I haven’t moderated the content.

Ingesting Bluesky into Materialize

The first thing we’ll need to do is get data flowing into Materialize. Most of this will be Javascript reading from Bluesky, but we’ll want to start by creating a home for the information in MZ.

sql
CREATE SOURCE bsky 
FROM WEBHOOK
BODY FORMAT JSON ARRAY;

If and when you type this it will provide a url back to you. Write that down; you’ll need it later!

The intent is to copy content from Bluesky into this table. I’m going to do this with some Javascript. I don’t really know Javascript, but I’m hoping that by revealing this and what I’ve written, folks may teach me something. So, not the best Javascript, but it will get data from Bluesky into Materialize.

In fact I’m going to use Bluesky’s Jetstream, which is a digested form of the firehose, minus the Merkle tree authenticating structure, and presented in JSON rather than CBOR. There are several publicly available replicas of that one can simply connect a web socket to, and read out of. Let’s start with that: create and connect a web socket to a Jetstream instance.

javascript
const bsky_uri = 'wss://jetstream2.us-east.bsky.network/subscribe?wantedCollections=app.bsky.feed.*';

const bsky_socket = new WebSocket(bsky_uri);
bsky_socket.onopen = () => { 
    console.log('Connected to bsky WebSocket');
}
bsky_socket.onerror = (error) => {
    console.error('Bluesky WebSocket error:', error);
}

Next we’re going to pump the data at Materialize through our webhook source. We’ll need that URL from the CREATE SOURCE command.

javascript
const mz_url = '<FROM CREATE SOURCE COMMAND>';

// We enqueue pending data as long as we have an outstanding insert to Materialize.
// These variables track the state of pending work (whether we are waiting, and with what data).
let awaiting_mz = false;
let pending_data = [];

bsky_socket.onmessage = async (msg) => {
    pending_data.push(JSON.parse(msg.data));
    if (!awaiting_mz) {
        let to_send = pending_data;
        pending_data = [];
        awaiting_mz = true;
        await postToWebhook(mz_url, to_send);
    }
}

The last bit is posting, which is mostly the work of Copilot. A coworker recomended the try/finally stuff, which I’ve totally not thought about in all the years of Rust I’ve been writing.

javascript
async function postToWebhook(url, data) {
    try {
        const response = await fetch(url, {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify(data),
        });
    }
    catch (e) {
        console.log(e)  
    }
    finally {
        awaiting_mz = false;
    }
}

That’s it! I will not be taking any questions on whether this is correct, how the concurrency works out, or whether I am open to recanting my computer science degree. Channeling this new AI era, it runs and does some stuff and doesn’t print errors, so it probably works!

If you put the above Javascript in a file foo.js, you can then node foo.js and it should start importing sweet social content for you.

Working with Bluesky in SQL

Let’s hop over to the Materialize side. I’m using our Console, but you should also be able to psql in if you prefer that.

If you haven’t already, create that webhook source we need.

sql
CREATE SOURCE bsky 
FROM WEBHOOK
BODY FORMAT JSON ARRAY;

This is the same command as above, I’m just reminding you about it in case you didn’t do it.

We can start with a quick query just to see that the data are flowing.

sql
SELECT COUNT(*) FROM bsky;

You can either run the query a few times to see the counts increasing (I hope), or you can SUBSCRIBE to a changefeed of the results (each time the count changes, from what old to what new values). I watched this for a little while, because it was fascinating to see it finally work, but ideally it just works for you and your interest may soon wane. Let’s move on!

With this data source in place, let’s try and tease out some of the JSON structure.

Jetstream’s JSON Structure

I have no idea what the fields are in the JSON we get out of Jetstream. I’m sure we can read about it together, but instead let’s just use SQL to directly look at the structure instead. JSON objects are a list of keys and values, where the values can be JSON objects themselves. We’ll iteratively unpack the keys for objects, descending into values that are objects, to spell out the paths of keys that exist.

sql
-- Unpack all JSON key paths in the input data.
WITH MUTUALLY RECURSIVE
-- List of text keys, followed by the remaining JSON.
paths (path TEXT LIST, blob JSONB) AS (
    SELECT LIST[]::TEXT LIST, body FROM bsky
    UNION ALL
    SELECT paths.path || jok.key, jok.value
    FROM paths, jsonb_each(blob) jok
)
-- Count the paths and order lexicographically.
SELECT
    COUNT(*),
    repeat('  ', list_length(path)) || path[list_length(path)] key
FROM paths
GROUP BY path
ORDER BY path;

This query recursively unpacks all paths down JSON objects, recording the keys in a text list. It does miss some things, like JSON arrays, but you could add them in if you understand what they imply (I do not, for these records). The output I’m seeing looks like so:

text
| count | key                         |
| ----- | --------------------------- |
| 9303  | null                        |
| 11    |   account                   |
| 11    |     active                  |
| 11    |     did                     |
| 11    |     seq                     |
| 3     |     status                  |
| 11    |     time                    |
| 9283  |   commit                    |
| 9131  |     cid                     |
| 9283  |     collection              |
| 9283  |     operation               |
| 9131  |     record                  |
| 9131  |       $type                 |
| 18    |       allow                 |
| 9131  |       createdAt             |

  ...

| 9     |     seq                     |
| 9     |     time                    |
| 9303  |   kind                      |
| 9303  |   time_us                   |

The null is the root, and each of the indented keys are path continuations from the parent. So, for example blob->'commit'->'record'->'createdAt' exists for 9,131 records; most of them.

With this information in hard, we can start to create some views, and pick out the meaning of each of these entries.

Building Views over Bluesky JSON

The JSON we are seeing come from several different “collections”, which I think correspond to semantically distinct piles of data. We are going to look at posts and replies, both of which are from the 'app.bsky.feed.post' collection. The distinction between a post and a reply seems to be whether blob->'commit'->'record'->'reply' exists (reply) or not (post). To be totally clear, I could be utterly wrong about this, and the nice thing is that we just need to change the SQL if so.

sql
CREATE VIEW posts AS
SELECT *
FROM bsky
WHERE body->'commit'->>'collection' = 'app.bsky.feed.post'
  AND body->'commit'->'record'->'reply' IS NULL;
sql
CREATE VIEW replies AS
SELECT *
FROM bsky
WHERE body->'commit'->>'collection' = 'app.bsky.feed.post'
  AND body->'commit'->'record'->'reply' IS NOT NULL;

Let’s also add a view that tells us how to connect posts with their corresponding URLs. This is a matter of popping out a few known fields, and reassembling them in a different order. The URL scheme is

https://bsky.app/profile//post/

from which we write the SQL

sql
CREATE VIEW urls AS 
SELECT
    body->'commit'->>'cid' AS cid,
    'https://bsky.app/profile/' || (body->>'did') || '/post/' || (body->'commit'->>'rkey') AS url
FROM bsky
WHERE body->'commit'->>'collection' = 'app.bsky.feed.post'
  AND body->'commit'->>'cid' IS NOT NULL;

If you want to grab #hashtags, you can extract those thusly. We won’t be using them in this post, but they are a great way to quickly get a take on what’s doing in the socials.

sql
CREATE VIEW tags AS
WITH 
facets as (
    SELECT 
        body,
        jsonb_array_elements(body->'commit'->'record'->'facets') AS facet
    FROM bsky
),
features AS (
    SELECT
        body,
        jsonb_array_elements(facet->'features') AS feature
    FROM facets
),
tags AS (
    SELECT body, feature->>'tag' AS tag
    FROM features 
    WHERE feature->>'tag' IS NOT NULL
)
select * from tags;

These relied on JSON arrays, which we didn’t explore in the section above. Sorry about that!

Timely information

We’re going to look at Bluesky data, but we might be more interested in recent data. Which hashtags are popular in posts and replies now rather than over the past day, or week, or year.

Materialize allows you to use time in WHERE filters, which automatically ages data out when the time passes the limit of the filter. You do this by using mz_now(), the function that gives you Materialize’s view of the current time, and some inequalities in a WHERE, like so:

sql
CREATE VIEW recent_posts AS
SELECT * 
FROM posts
WHERE mz_now() < ((body->>'time_us')::bigint/1000 + 300000);
sql
CREATE VIEW recent_replies AS
SELECT * 
FROM replies
WHERE mz_now() < ((body->>'time_us')::bigint/1000 + 300000);

These both now reflect all posts and replies within the past five minutes. Importantly, this is a live view over the past five minutes, rather than a snapshot emitted once every five minutes.

You can subscribe to both of these, and you’ll see they go up and down, and generally stabilize rather than grow without bound.

| count | count |
| ----- | ----- |
| 2097  | 2365  |

You might wonder: why all the complexity to count things every minute? Surely we could take time_us and round it to “the minute”, and count the posts and replies. Something like:

sql
-- Number of posts in each minute.
SELECT ((body->>'time_us')::bigint/300000000) bucket, COUNT(*)
FROM posts
GROUP BY 1;
-- Number of replies in each minute.
SELECT ((body->>'time_us')::bigint/300000000) bucket, COUNT(*)
FROM replies
GROUP BY 1;

This produces a continually evolving count of the posts and replies associated with each minute. Each post or reply lands in exactly one bucket, and contributes to just one count. That works great for counting, and a few other statistics, but it falls down on one of the main properties of social data: that the events reference each other. Non-trivial social interactions span time, and time buckets, and we’ll need an approach other than bucketing to watch them evolve.


Let’s try and do something non-trivial with these posts and replies. Let’s reassemble conversations that happened, by stitching together each reply with the post (or reply) it is in response to.

Maintaining some Results

Before moving onward, we’re going to save ourselves some time and prompt Materialize to maintain some of the results we’ve defined.

We’ve defined quite a few views, but so far they are only views: shortcuts to explain what a query is if we issue it in the future. We’ll want to actively compute and maintain these results, to save ourselves the time of having Materialize reconsider all of the data with each query we perform.

We’re going to use two collections repeatedly, both of which are reductions of the input data: recent_replies is a subset of replies within the past minute, and urls reflect all of the data but only the cid and url.

sql
CREATE DEFAULT INDEX ON recent_replies;
CREATE DEFAULT INDEX ON urls;

With these in hand, our interactive exploration will go much faster!

Tracking Active Posts

To start with, we can parse out some information from replies. Here we’ll grab the commit ids (I think that is what cid means) of the reply, its immediate parent, and the root of the conversation (present in the record).

sql
-- Useful data for each reply.
CREATE VIEW replies_parsed AS
SELECT body->'commit'->>'cid' AS reply_cid,
       body->'commit'->'record'->'reply'->'parent'->>'cid' AS parent_cid,
       body->'commit'->'record'->'reply'->'root'->>'cid' AS root_cid,
       body->>'did' AS author
FROM recent_replies;

We can now look for the most active posts, as a function of the number of replies:

sql
SELECT COUNT(*), root_cid
FROM replies_parsed
GROUP BY root_cid
ORDER BY count DESC
LIMIT 10;

which gives us

| count | root_cid                                                    |
| ----- | ----------------------------------------------------------- |
| 76    | bafyreifvguvy33rknoumvg4z77vaxjyux4abe4w7ijrjojennr2vln6pwy |
| 57    | bafyreih6xwxpvqov5bycpjwusqy57ebbloq243yw2ql4c3xst2vmfssczi |
| 30    | bafyreibjquxhr462ab6osy2744gzim4er6nttjr2wbnbvadelq6smhcuae |
| 26    | bafyreiexdchamc4eq2642noyjorubghwln7hhls5gqha6qbopg5sikn5sy |
| 25    | bafyreig5vpzkahmxs5dca5xf2nu6qmowmhcg46anxqjmtx254t676ouaxe |
| 25    | bafyreigp5j4zhbmp4h7qf5w5a24uxqqfl7g6g74dwlpunhkzrjg3vrhojy |
| 24    | bafyreihgun2h6xzu3glkyxtq5k5ycgtjtivhdfq3ql4ureuvayqhjj4uy4 |
| 23    | bafyreibryywro3pyfpvntlsaxr75obwzvwvdikqscgjowsxymmn62bsaqe |
| 22    | bafyreifbaxat6d67pqp5ye5kejmnbpvrysftxl2ppt543dbztecmc4efiy |
| 20    | bafyreigrdwm6cclffwvgvjfh5hxqxvefbey3eytbbr3unyxshhajumoxw4 |

That’s .. not super clear. Fortunately, we can join with urls to get clickable links.

sql
SELECT COUNT(*), url
FROM replies_parsed, urls
WHERE root_cid = urls.cid
GROUP BY 2
ORDER BY count DESC
LIMIT 10;

These are now clickable links, continually updated, reporting the hottest posts.

| count | url                                                                          |
| ----- | ---------------------------------------------------------------------------- |
| 76    | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a |
| 57    | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcptqe742u |
| 30    | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfd24mubp2a |
| 25    | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcycunl72q |
| 25    | https://bsky.app/profile/did:plc:h2ataqkfhejqhlkmmdknshxb/post/3ltfcxkr5lu2w |
| 17    | https://bsky.app/profile/did:plc:ln72v57ivz2g46uqf4xxqiuh/post/3ltfbxyfdo722 |
| 16    | https://bsky.app/profile/did:plc:2buz4gf5sew3rdwzbxsvcd4s/post/3ltfcvjkefc2g |
| 16    | https://bsky.app/profile/did:plc:af57yox3yialvrqahze5pfpe/post/3ltfch4swe225 |
| 14    | https://bsky.app/profile/did:plc:t4x2ruk2qmob2b2cx55h4v7r/post/3ltf25nctt22g |
| 13    | https://bsky.app/profile/did:plc:6vomxgshphbxhvru6q7bpist/post/3ltfcqwk67k27 |

You might notice that we don’t have all the same counts. Some of the replies are to posts not in our crawl, because I’ve restarted many things as part of the experiment.

But total replies may not be an indicator of interest, as much as activity. Let’s add a COUNT(DISTINCT author) aggregation, which additionally scores posts by the number of distinct participants, rather than their number of replies.

| participants | count | url                                                                          |
| ------------ | ----- | ---------------------------------------------------------------------------- |
| 74           | 76    | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a |
| 53           | 57    | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcptqe742u |
| 29           | 30    | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfd24mubp2a |
| 1            | 25    | https://bsky.app/profile/did:plc:h2ataqkfhejqhlkmmdknshxb/post/3ltfcxkr5lu2w |
| 25           | 25    | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcycunl72q |
| 16           | 17    | https://bsky.app/profile/did:plc:ln72v57ivz2g46uqf4xxqiuh/post/3ltfbxyfdo722 |
| 8            | 16    | https://bsky.app/profile/did:plc:af57yox3yialvrqahze5pfpe/post/3ltfch4swe225 |
| 16           | 16    | https://bsky.app/profile/did:plc:2buz4gf5sew3rdwzbxsvcd4s/post/3ltfcvjkefc2g |
| 14           | 14    | https://bsky.app/profile/did:plc:t4x2ruk2qmob2b2cx55h4v7r/post/3ltf25nctt22g |
| 5            | 13    | https://bsky.app/profile/did:plc:6vomxgshphbxhvru6q7bpist/post/3ltfcqwk67k27 |

We are still ordering by count, and you can see that there are a few distinct flavors here. Some posts have as many participants as responses (one-off comments), and some posts have only one participant (more of a read-out). One is clearly a low number of participants, but what about 25, or 8? Are these potentially exciting conversations, or run of the mill?

Monitoring Distributions (of Participants)

How many distinct participants do we expect to see in response to each post?

Let’s look at the distribution over the number of distinct participants, to see what levels of interaction occur across all of Bluesky.

sql
SELECT participants, COUNT(*)
FROM (
    SELECT COUNT(DISTINCT author) AS participants
    FROM replies_parsed
    GROUP BY root_cid
)
GROUP BY participants
ORDER BY count DESC;
| participants | count |
| ------------ | ----- |
| 1            | 4889  |
| 2            | 718   |
| 3            | 158   |
| 4            | 58    |
| 5            | 30    |
| 6            | 16    |
| 7            | 11    |
| 8            | 7     |
| 9            | 7     |
| 10           | 5     |
| 13           | 3     |
| 15           | 2     |
| 16           | 2     |
| 22           | 2     |
| 11           | 1     |
| 14           | 1     |
| 18           | 1     |
| 25           | 1     |
| 29           | 1     |
| 53           | 1     |
| 74           | 1     |

Even 8 distinct participants is well into the tail of the distribution, and a statistically significant level of engagement. Engagement with the root post itself, at least, although replies themselves do not necessarily a conversation make. The highest scoring posts here are most politely described as “rage bait”, and the single-reply authors suggest not much interaction happened.

Let’s dive deeper into a search for authentic communication.

Looking for Interactions

Posts with lots of responses may be easy, because folks just blurt out their takes. What does real interaction look like?

Let’s look for paths from replies up towards their root posts where authorship changes, scoring each path by the number of author alternations.

sql
CREATE VIEW conversations AS
WITH MUTUALLY RECURSIVE
    threads(
        reply_cid text,
        author text,
        responses int4
    ) AS (
        SELECT 
            reply_cid,
            author,
            MAX(responses)        
        FROM (
            SELECT reply_cid, author, 0 as responses
            FROM replies_parsed
            UNION ALL
            SELECT 
                rp.reply_cid, 
                rp.author, 
                threads.responses + CASE WHEN threads.author != rp.author THEN 1 ELSE 0 END
            FROM threads, replies_parsed rp
            WHERE threads.reply_cid = rp.parent_cid
        )
        GROUP BY 1, 2
    )
SELECT reply_cid, responses
FROM threads

As before, we’ll want to stitch these together with their URLs to make sense of them.

sql
-- Stich together conversations with their URLs.
CREATE VIEW linked_convos AS
SELECT responses, url
FROM conversations c, urls
WHERE urls.cid = c.reply_cid;

Selecting the top ten posts by their thread with the most author alternations produces

| responses | url                                                                          |
| --------- | ---------------------------------------------------------------------------- |
| 6         | https://bsky.app/profile/did:plc:4lmrw65t5zpy7iiqtot4lgsg/post/3ltfd5jgku22x |
| 5         | https://bsky.app/profile/did:plc:2ubpspaf7ujjdr2e5xprhrzu/post/3ltfd465wb22f |
| 5         | https://bsky.app/profile/did:plc:gkmyaujvjpvowpumofx4aukj/post/3ltfd55nu622d |
| 4         | https://bsky.app/profile/did:plc:fi2keywfodbhnsgbvfwliwep/post/3ltfd36igmk2e |
| 4         | https://bsky.app/profile/did:plc:hiuucknp3uehfnnujnxlvpzt/post/3ltfd3mdeh22i |
| 4         | https://bsky.app/profile/did:plc:qmhqmqfnrtmjsnkfjblfuxre/post/3ltfd2oqeqc2w |
| 4         | https://bsky.app/profile/did:plc:retp7mhsewmszkqxom2n3kpl/post/3ltfd4mzxns2q |
| 3         | https://bsky.app/profile/did:plc:3ths7aqkejjiogfvg34ubg4h/post/3ltfd3upof22p |
| 3         | https://bsky.app/profile/did:plc:gkmyaujvjpvowpumofx4aukj/post/3ltfczslepc2d |
| 3         | https://bsky.app/profile/did:plc:zeanejbbygnlc74253s5gyok/post/3ltfd45pjd22p |

These look more like real communication. The sort of real communication you might expect when you have six author alternations in five minutes. Many of them are to posts in the same thread, and we could deduplicate by the root post, or something similar.

Looking Further Afield

We’ve studied individual posts, but Bluesky (and “social” generally) are not just siloed posts. For each post, we can look for similar posts as judged by their interactions. For example, for each post which other post has the largest number of respondents in common?

sql
CREATE VIEW recommended AS
SELECT DISTINCT ON (r1) r1, r2, shared
FROM (
    SELECT 
        r1.root_cid r1, 
        r2.root_cid r2, 
        COUNT(DISTINCT r1.author) as shared
    FROM replies_parsed r1, replies_parsed r2
    WHERE r1.author = r2.author
      AND r1.root_cid != r2.root_cid
      -- Restrict our attention to recommendations we've crawled
      AND r2.root_cid IN (SELECT cid FROM urls)
    GROUP BY 1, 2
)
ORDER BY r1, shared DESC, r2;

This provides a “recommendation” from each post to some other post, based not on its content but the pattern of interaction with it. The same people who interacted with this post also interact with that post. We could even look at timestamps to see if they then go on to interact with the second post. Where do they ultimately end up?

For the moment, let’s just add some URLs to the recommendations so that we can check them out.

sql
SELECT  u1.url AS this, u2.url AS that, shared
FROM recommended, urls u1,  urls u2
WHERE r1 = u1.cid
  AND r2 = u2.cid
ORDER BY shared DESC
LIMIT 10;

The recommendations from reading this to next read that are as follows (sorry for the wall of text):

| this                                                                         | that                                                                         | shared |
| ---------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ------ |
| https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcptqe742u | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a | 8      |
| https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcptqe742u | 8      |
| https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfd24mubp2a | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a | 3      |
| https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcycunl72q | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfd24mubp2a | 2      |
| https://bsky.app/profile/did:plc:euz3fvldom5fvdf4vp7s5bak/post/3ltfcmb2stc27 | https://bsky.app/profile/did:plc:i52jkuv7iobxqvdabckhq6ui/post/3ltfclaqj6c2l | 2      |
| https://bsky.app/profile/did:plc:i52jkuv7iobxqvdabckhq6ui/post/3ltfclaqj6c2l | https://bsky.app/profile/did:plc:euz3fvldom5fvdf4vp7s5bak/post/3ltfcmb2stc27 | 2      |
| https://bsky.app/profile/did:plc:kdphmaoqeaouax2kis27556e/post/3ltfcbd3ros26 | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a | 2      |
| https://bsky.app/profile/did:plc:2bfdxrptilof6qhw57koblcc/post/3ltfcqacbws2j | https://bsky.app/profile/did:plc:qcgbjlxqdwv2mcolldoljvbx/post/3ltfcr2ouck2e | 1      |
| https://bsky.app/profile/did:plc:2khndv2hbyuzwt66zona4gk5/post/3ltfcxij24k2u | https://bsky.app/profile/did:plc:aezl4prncwzcxn2daogn4ezt/post/3ltfcva4k6s2p | 1      |
| https://bsky.app/profile/did:plc:2ullnxyoicn2tzx2paseoflf/post/3ltfccjf7kk2i | https://bsky.app/profile/did:plc:5o6k7jvowuyaquloafzn3cfw/post/3ltfcdgndmc22 | 1      |

The high count recommendations are mostly back to the high volume posts, which doesn’t necessarily make them high quality recommendations. The next steps have been the study of Information Retrieval since time immemorial (any time preceding the advent of information retrieval): precision, recall, mean average precision, the Dice-Sørensen coefficient. Ideally, your careful thought can be more easily translated into action at this point, where you just need to type the SQL, and the results will be kept live and up to date.

We didn’t even crack open the likes and reposts, which are further (if quieter) signals of interest.

Wrapping up

The story told above is against the backdrop of continually changing data, but at one moment in time. Although the story evolves over time, all of the queries were executed in one Materialize transaction, and their all results tie out exactly. Constrained to the format of static text, that was the coolest trick I could pull off.

To get the real experience yourself, rather than a point in time view over some data, you should absolutely go and grab a copy of Materialize yourself. Our self-managed lets you try the real bits, and all of the analysis above fit within the bounds of the community edition.

Appendix

All of the results above are entirely consistent with each other, despite continually changing as their inputs change, because I performed them in one Materialize transaction. If you would like to do the same, having already defined the necessary views, you can copy/paste this SQL:

sql
BEGIN;

SELECT COUNT(*), root_cid
FROM replies_parsed
GROUP BY root_cid
ORDER BY count DESC
LIMIT 10;

SELECT COUNT(*), url
FROM replies_parsed, urls
WHERE root_cid = urls.cid
GROUP BY 2
ORDER BY count DESC
LIMIT 10;

SELECT COUNT(DISTINCT author) as participants, COUNT(*), url
FROM replies_parsed, urls
WHERE root_cid = urls.cid
GROUP BY 3
ORDER BY count DESC
LIMIT 10;

SELECT participants, COUNT(*)
FROM (
    SELECT COUNT(DISTINCT author) AS participants
    FROM replies_parsed
    GROUP BY root_cid
)
GROUP BY participants
ORDER BY count DESC;

SELECT * 
FROM linked_convos
ORDER BY responses DESC
LIMIT 10;

SELECT  u1.url AS this, u2.url AS that, shared
FROM recommended, urls u1,  urls u2
WHERE r1 = u1.cid
  AND r2 = u2.cid
ORDER BY shared DESC
LIMIT 10;

COMMIT;

Get Started with Materialize