Twitter was, for a period, a really fascinating place to watch authentic human interaction unfold. You could crawl all the tweets, query a few APIs, or drink from the firehose. You can track trending hashtags, exploding viral content, and engaging conversations, all as it happens. Twitter eventually put the firehose behind a paywall, and .. it also stopped being Twitter.
One of the recent upsides is that relatively more folks are now on Bluesky, which does have a public firehose.
Today we’re going to pull data from the Bluesky firehose into Materialize. And great news, you can follow along by grabbing the recently released Community Edition. The link will still work by the end of the post, so if it sounds interesting enough at that point, have a click.
Let me stress now, having done this a few times: it is so much more compelling to see the data change than to look at results that I have copy/pasted. You’ll want to go and see the data changing live yourself! We’ll have some representative outputs, but one of the coolest things about live data is that it changes, and you’ll uncover a different narrative.
Dear reader: Social data is weird! Humans are weird! I’m intentionally capturing data as it is, unfiltered, but it means you might click on something and have that weirdness piped directly into your brain. This can be great, but know that I haven’t moderated the content.
Ingesting Bluesky into Materialize
The first thing we’ll need to do is get data flowing into Materialize. Most of this will be Javascript reading from Bluesky, but we’ll want to start by creating a home for the information in MZ.
CREATE SOURCE bsky
FROM WEBHOOK
BODY FORMAT JSON ARRAY;
If and when you type this it will provide a url back to you. Write that down; you’ll need it later!
The intent is to copy content from Bluesky into this table. I’m going to do this with some Javascript. I don’t really know Javascript, but I’m hoping that by revealing this and what I’ve written, folks may teach me something. So, not the best Javascript, but it will get data from Bluesky into Materialize.
In fact I’m going to use Bluesky’s Jetstream, which is a digested form of the firehose, minus the Merkle tree authenticating structure, and presented in JSON rather than CBOR. There are several publicly available replicas of that one can simply connect a web socket to, and read out of. Let’s start with that: create and connect a web socket to a Jetstream instance.
const bsky_uri = 'wss://jetstream2.us-east.bsky.network/subscribe?wantedCollections=app.bsky.feed.*';
const bsky_socket = new WebSocket(bsky_uri);
bsky_socket.onopen = () => {
console.log('Connected to bsky WebSocket');
}
bsky_socket.onerror = (error) => {
console.error('Bluesky WebSocket error:', error);
}
Next we’re going to pump the data at Materialize through our webhook source.
We’ll need that URL from the CREATE SOURCE
command.
const mz_url = '<FROM CREATE SOURCE COMMAND>';
// We enqueue pending data as long as we have an outstanding insert to Materialize.
// These variables track the state of pending work (whether we are waiting, and with what data).
let awaiting_mz = false;
let pending_data = [];
bsky_socket.onmessage = async (msg) => {
pending_data.push(JSON.parse(msg.data));
if (!awaiting_mz) {
let to_send = pending_data;
pending_data = [];
awaiting_mz = true;
await postToWebhook(mz_url, to_send);
}
}
The last bit is posting, which is mostly the work of Copilot.
A coworker recomended the try
/finally
stuff, which I’ve totally not thought about in all the years of Rust I’ve been writing.
async function postToWebhook(url, data) {
try {
const response = await fetch(url, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(data),
});
}
catch (e) {
console.log(e)
}
finally {
awaiting_mz = false;
}
}
That’s it! I will not be taking any questions on whether this is correct, how the concurrency works out, or whether I am open to recanting my computer science degree. Channeling this new AI era, it runs and does some stuff and doesn’t print errors, so it probably works!
If you put the above Javascript in a file foo.js
, you can then node foo.js
and it should start importing sweet social content for you.
Working with Bluesky in SQL
Let’s hop over to the Materialize side.
I’m using our Console, but you should also be able to psql
in if you prefer that.
If you haven’t already, create that webhook source we need.
CREATE SOURCE bsky
FROM WEBHOOK
BODY FORMAT JSON ARRAY;
This is the same command as above, I’m just reminding you about it in case you didn’t do it.
We can start with a quick query just to see that the data are flowing.
SELECT COUNT(*) FROM bsky;
You can either run the query a few times to see the counts increasing (I hope), or you can SUBSCRIBE
to a changefeed of the results (each time the count changes, from what old to what new values).
I watched this for a little while, because it was fascinating to see it finally work, but ideally it just works for you and your interest may soon wane.
Let’s move on!
With this data source in place, let’s try and tease out some of the JSON structure.
Jetstream’s JSON Structure
I have no idea what the fields are in the JSON we get out of Jetstream. I’m sure we can read about it together, but instead let’s just use SQL to directly look at the structure instead. JSON objects are a list of keys and values, where the values can be JSON objects themselves. We’ll iteratively unpack the keys for objects, descending into values that are objects, to spell out the paths of keys that exist.
-- Unpack all JSON key paths in the input data.
WITH MUTUALLY RECURSIVE
-- List of text keys, followed by the remaining JSON.
paths (path TEXT LIST, blob JSONB) AS (
SELECT LIST[]::TEXT LIST, body FROM bsky
UNION ALL
SELECT paths.path || jok.key, jok.value
FROM paths, jsonb_each(blob) jok
)
-- Count the paths and order lexicographically.
SELECT
COUNT(*),
repeat(' ', list_length(path)) || path[list_length(path)] key
FROM paths
GROUP BY path
ORDER BY path;
This query recursively unpacks all paths down JSON objects, recording the keys in a text list. It does miss some things, like JSON arrays, but you could add them in if you understand what they imply (I do not, for these records). The output I’m seeing looks like so:
| count | key |
| ----- | --------------------------- |
| 9303 | null |
| 11 | account |
| 11 | active |
| 11 | did |
| 11 | seq |
| 3 | status |
| 11 | time |
| 9283 | commit |
| 9131 | cid |
| 9283 | collection |
| 9283 | operation |
| 9131 | record |
| 9131 | $type |
| 18 | allow |
| 9131 | createdAt |
...
| 9 | seq |
| 9 | time |
| 9303 | kind |
| 9303 | time_us |
The null
is the root, and each of the indented keys are path continuations from the parent.
So, for example blob->'commit'->'record'->'createdAt'
exists for 9,131 records; most of them.
With this information in hard, we can start to create some views, and pick out the meaning of each of these entries.
Building Views over Bluesky JSON
The JSON we are seeing come from several different “collections”, which I think correspond to semantically distinct piles of data.
We are going to look at posts and replies, both of which are from the 'app.bsky.feed.post'
collection.
The distinction between a post and a reply seems to be whether blob->'commit'->'record'->'reply'
exists (reply) or not (post).
To be totally clear, I could be utterly wrong about this, and the nice thing is that we just need to change the SQL if so.
CREATE VIEW posts AS
SELECT *
FROM bsky
WHERE body->'commit'->>'collection' = 'app.bsky.feed.post'
AND body->'commit'->'record'->'reply' IS NULL;
CREATE VIEW replies AS
SELECT *
FROM bsky
WHERE body->'commit'->>'collection' = 'app.bsky.feed.post'
AND body->'commit'->'record'->'reply' IS NOT NULL;
Let’s also add a view that tells us how to connect posts with their corresponding URLs. This is a matter of popping out a few known fields, and reassembling them in a different order. The URL scheme is
https://bsky.app/profile//post/
from which we write the SQL
CREATE VIEW urls AS
SELECT
body->'commit'->>'cid' AS cid,
'https://bsky.app/profile/' || (body->>'did') || '/post/' || (body->'commit'->>'rkey') AS url
FROM bsky
WHERE body->'commit'->>'collection' = 'app.bsky.feed.post'
AND body->'commit'->>'cid' IS NOT NULL;
If you want to grab #hashtags, you can extract those thusly. We won’t be using them in this post, but they are a great way to quickly get a take on what’s doing in the socials.
CREATE VIEW tags AS
WITH
facets as (
SELECT
body,
jsonb_array_elements(body->'commit'->'record'->'facets') AS facet
FROM bsky
),
features AS (
SELECT
body,
jsonb_array_elements(facet->'features') AS feature
FROM facets
),
tags AS (
SELECT body, feature->>'tag' AS tag
FROM features
WHERE feature->>'tag' IS NOT NULL
)
select * from tags;
These relied on JSON arrays, which we didn’t explore in the section above. Sorry about that!
Timely information
We’re going to look at Bluesky data, but we might be more interested in recent data. Which hashtags are popular in posts and replies now rather than over the past day, or week, or year.
Materialize allows you to use time in WHERE
filters, which automatically ages data out when the time passes the limit of the filter.
You do this by using mz_now()
, the function that gives you Materialize’s view of the current time, and some inequalities in a WHERE
, like so:
CREATE VIEW recent_posts AS
SELECT *
FROM posts
WHERE mz_now() < ((body->>'time_us')::bigint/1000 + 300000);
CREATE VIEW recent_replies AS
SELECT *
FROM replies
WHERE mz_now() < ((body->>'time_us')::bigint/1000 + 300000);
These both now reflect all posts and replies within the past five minutes. Importantly, this is a live view over the past five minutes, rather than a snapshot emitted once every five minutes.
You can subscribe to both of these, and you’ll see they go up and down, and generally stabilize rather than grow without bound.
| count | count |
| ----- | ----- |
| 2097 | 2365 |
You might wonder: why all the complexity to count things every minute?
Surely we could take time_us
and round it to “the minute”, and count the posts and replies.
Something like:
-- Number of posts in each minute.
SELECT ((body->>'time_us')::bigint/300000000) bucket, COUNT(*)
FROM posts
GROUP BY 1;
-- Number of replies in each minute.
SELECT ((body->>'time_us')::bigint/300000000) bucket, COUNT(*)
FROM replies
GROUP BY 1;
This produces a continually evolving count of the posts and replies associated with each minute. Each post or reply lands in exactly one bucket, and contributes to just one count. That works great for counting, and a few other statistics, but it falls down on one of the main properties of social data: that the events reference each other. Non-trivial social interactions span time, and time buckets, and we’ll need an approach other than bucketing to watch them evolve.
Let’s try and do something non-trivial with these posts and replies. Let’s reassemble conversations that happened, by stitching together each reply with the post (or reply) it is in response to.
Maintaining some Results
Before moving onward, we’re going to save ourselves some time and prompt Materialize to maintain some of the results we’ve defined.
We’ve defined quite a few views, but so far they are only views: shortcuts to explain what a query is if we issue it in the future. We’ll want to actively compute and maintain these results, to save ourselves the time of having Materialize reconsider all of the data with each query we perform.
We’re going to use two collections repeatedly, both of which are reductions of the input data: recent_replies
is a subset of replies within the past minute, and urls
reflect all of the data but only the cid
and url
.
CREATE DEFAULT INDEX ON recent_replies;
CREATE DEFAULT INDEX ON urls;
With these in hand, our interactive exploration will go much faster!
Tracking Active Posts
To start with, we can parse out some information from replies.
Here we’ll grab the commit ids (I think that is what cid
means) of the reply, its immediate parent, and the root of the conversation (present in the record).
-- Useful data for each reply.
CREATE VIEW replies_parsed AS
SELECT body->'commit'->>'cid' AS reply_cid,
body->'commit'->'record'->'reply'->'parent'->>'cid' AS parent_cid,
body->'commit'->'record'->'reply'->'root'->>'cid' AS root_cid,
body->>'did' AS author
FROM recent_replies;
We can now look for the most active posts, as a function of the number of replies:
SELECT COUNT(*), root_cid
FROM replies_parsed
GROUP BY root_cid
ORDER BY count DESC
LIMIT 10;
which gives us
| count | root_cid |
| ----- | ----------------------------------------------------------- |
| 76 | bafyreifvguvy33rknoumvg4z77vaxjyux4abe4w7ijrjojennr2vln6pwy |
| 57 | bafyreih6xwxpvqov5bycpjwusqy57ebbloq243yw2ql4c3xst2vmfssczi |
| 30 | bafyreibjquxhr462ab6osy2744gzim4er6nttjr2wbnbvadelq6smhcuae |
| 26 | bafyreiexdchamc4eq2642noyjorubghwln7hhls5gqha6qbopg5sikn5sy |
| 25 | bafyreig5vpzkahmxs5dca5xf2nu6qmowmhcg46anxqjmtx254t676ouaxe |
| 25 | bafyreigp5j4zhbmp4h7qf5w5a24uxqqfl7g6g74dwlpunhkzrjg3vrhojy |
| 24 | bafyreihgun2h6xzu3glkyxtq5k5ycgtjtivhdfq3ql4ureuvayqhjj4uy4 |
| 23 | bafyreibryywro3pyfpvntlsaxr75obwzvwvdikqscgjowsxymmn62bsaqe |
| 22 | bafyreifbaxat6d67pqp5ye5kejmnbpvrysftxl2ppt543dbztecmc4efiy |
| 20 | bafyreigrdwm6cclffwvgvjfh5hxqxvefbey3eytbbr3unyxshhajumoxw4 |
That’s .. not super clear.
Fortunately, we can join with urls
to get clickable links.
SELECT COUNT(*), url
FROM replies_parsed, urls
WHERE root_cid = urls.cid
GROUP BY 2
ORDER BY count DESC
LIMIT 10;
These are now clickable links, continually updated, reporting the hottest posts.
| count | url |
| ----- | ---------------------------------------------------------------------------- |
| 76 | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a |
| 57 | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcptqe742u |
| 30 | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfd24mubp2a |
| 25 | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcycunl72q |
| 25 | https://bsky.app/profile/did:plc:h2ataqkfhejqhlkmmdknshxb/post/3ltfcxkr5lu2w |
| 17 | https://bsky.app/profile/did:plc:ln72v57ivz2g46uqf4xxqiuh/post/3ltfbxyfdo722 |
| 16 | https://bsky.app/profile/did:plc:2buz4gf5sew3rdwzbxsvcd4s/post/3ltfcvjkefc2g |
| 16 | https://bsky.app/profile/did:plc:af57yox3yialvrqahze5pfpe/post/3ltfch4swe225 |
| 14 | https://bsky.app/profile/did:plc:t4x2ruk2qmob2b2cx55h4v7r/post/3ltf25nctt22g |
| 13 | https://bsky.app/profile/did:plc:6vomxgshphbxhvru6q7bpist/post/3ltfcqwk67k27 |
You might notice that we don’t have all the same counts. Some of the replies are to posts not in our crawl, because I’ve restarted many things as part of the experiment.
But total replies may not be an indicator of interest, as much as activity.
Let’s add a COUNT(DISTINCT author)
aggregation, which additionally scores posts by the number of distinct participants, rather than their number of replies.
| participants | count | url |
| ------------ | ----- | ---------------------------------------------------------------------------- |
| 74 | 76 | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a |
| 53 | 57 | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcptqe742u |
| 29 | 30 | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfd24mubp2a |
| 1 | 25 | https://bsky.app/profile/did:plc:h2ataqkfhejqhlkmmdknshxb/post/3ltfcxkr5lu2w |
| 25 | 25 | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcycunl72q |
| 16 | 17 | https://bsky.app/profile/did:plc:ln72v57ivz2g46uqf4xxqiuh/post/3ltfbxyfdo722 |
| 8 | 16 | https://bsky.app/profile/did:plc:af57yox3yialvrqahze5pfpe/post/3ltfch4swe225 |
| 16 | 16 | https://bsky.app/profile/did:plc:2buz4gf5sew3rdwzbxsvcd4s/post/3ltfcvjkefc2g |
| 14 | 14 | https://bsky.app/profile/did:plc:t4x2ruk2qmob2b2cx55h4v7r/post/3ltf25nctt22g |
| 5 | 13 | https://bsky.app/profile/did:plc:6vomxgshphbxhvru6q7bpist/post/3ltfcqwk67k27 |
We are still ordering by count, and you can see that there are a few distinct flavors here. Some posts have as many participants as responses (one-off comments), and some posts have only one participant (more of a read-out). One is clearly a low number of participants, but what about 25, or 8? Are these potentially exciting conversations, or run of the mill?
Monitoring Distributions (of Participants)
How many distinct participants do we expect to see in response to each post?
Let’s look at the distribution over the number of distinct participants, to see what levels of interaction occur across all of Bluesky.
SELECT participants, COUNT(*)
FROM (
SELECT COUNT(DISTINCT author) AS participants
FROM replies_parsed
GROUP BY root_cid
)
GROUP BY participants
ORDER BY count DESC;
| participants | count |
| ------------ | ----- |
| 1 | 4889 |
| 2 | 718 |
| 3 | 158 |
| 4 | 58 |
| 5 | 30 |
| 6 | 16 |
| 7 | 11 |
| 8 | 7 |
| 9 | 7 |
| 10 | 5 |
| 13 | 3 |
| 15 | 2 |
| 16 | 2 |
| 22 | 2 |
| 11 | 1 |
| 14 | 1 |
| 18 | 1 |
| 25 | 1 |
| 29 | 1 |
| 53 | 1 |
| 74 | 1 |
Even 8 distinct participants is well into the tail of the distribution, and a statistically significant level of engagement. Engagement with the root post itself, at least, although replies themselves do not necessarily a conversation make. The highest scoring posts here are most politely described as “rage bait”, and the single-reply authors suggest not much interaction happened.
Let’s dive deeper into a search for authentic communication.
Looking for Interactions
Posts with lots of responses may be easy, because folks just blurt out their takes. What does real interaction look like?
Let’s look for paths from replies up towards their root posts where authorship changes, scoring each path by the number of author alternations.
CREATE VIEW conversations AS
WITH MUTUALLY RECURSIVE
threads(
reply_cid text,
author text,
responses int4
) AS (
SELECT
reply_cid,
author,
MAX(responses)
FROM (
SELECT reply_cid, author, 0 as responses
FROM replies_parsed
UNION ALL
SELECT
rp.reply_cid,
rp.author,
threads.responses + CASE WHEN threads.author != rp.author THEN 1 ELSE 0 END
FROM threads, replies_parsed rp
WHERE threads.reply_cid = rp.parent_cid
)
GROUP BY 1, 2
)
SELECT reply_cid, responses
FROM threads
As before, we’ll want to stitch these together with their URLs to make sense of them.
-- Stich together conversations with their URLs.
CREATE VIEW linked_convos AS
SELECT responses, url
FROM conversations c, urls
WHERE urls.cid = c.reply_cid;
Selecting the top ten posts by their thread with the most author alternations produces
| responses | url |
| --------- | ---------------------------------------------------------------------------- |
| 6 | https://bsky.app/profile/did:plc:4lmrw65t5zpy7iiqtot4lgsg/post/3ltfd5jgku22x |
| 5 | https://bsky.app/profile/did:plc:2ubpspaf7ujjdr2e5xprhrzu/post/3ltfd465wb22f |
| 5 | https://bsky.app/profile/did:plc:gkmyaujvjpvowpumofx4aukj/post/3ltfd55nu622d |
| 4 | https://bsky.app/profile/did:plc:fi2keywfodbhnsgbvfwliwep/post/3ltfd36igmk2e |
| 4 | https://bsky.app/profile/did:plc:hiuucknp3uehfnnujnxlvpzt/post/3ltfd3mdeh22i |
| 4 | https://bsky.app/profile/did:plc:qmhqmqfnrtmjsnkfjblfuxre/post/3ltfd2oqeqc2w |
| 4 | https://bsky.app/profile/did:plc:retp7mhsewmszkqxom2n3kpl/post/3ltfd4mzxns2q |
| 3 | https://bsky.app/profile/did:plc:3ths7aqkejjiogfvg34ubg4h/post/3ltfd3upof22p |
| 3 | https://bsky.app/profile/did:plc:gkmyaujvjpvowpumofx4aukj/post/3ltfczslepc2d |
| 3 | https://bsky.app/profile/did:plc:zeanejbbygnlc74253s5gyok/post/3ltfd45pjd22p |
These look more like real communication. The sort of real communication you might expect when you have six author alternations in five minutes. Many of them are to posts in the same thread, and we could deduplicate by the root post, or something similar.
Looking Further Afield
We’ve studied individual posts, but Bluesky (and “social” generally) are not just siloed posts. For each post, we can look for similar posts as judged by their interactions. For example, for each post which other post has the largest number of respondents in common?
CREATE VIEW recommended AS
SELECT DISTINCT ON (r1) r1, r2, shared
FROM (
SELECT
r1.root_cid r1,
r2.root_cid r2,
COUNT(DISTINCT r1.author) as shared
FROM replies_parsed r1, replies_parsed r2
WHERE r1.author = r2.author
AND r1.root_cid != r2.root_cid
-- Restrict our attention to recommendations we've crawled
AND r2.root_cid IN (SELECT cid FROM urls)
GROUP BY 1, 2
)
ORDER BY r1, shared DESC, r2;
This provides a “recommendation” from each post to some other post, based not on its content but the pattern of interaction with it. The same people who interacted with this post also interact with that post. We could even look at timestamps to see if they then go on to interact with the second post. Where do they ultimately end up?
For the moment, let’s just add some URLs to the recommendations so that we can check them out.
SELECT u1.url AS this, u2.url AS that, shared
FROM recommended, urls u1, urls u2
WHERE r1 = u1.cid
AND r2 = u2.cid
ORDER BY shared DESC
LIMIT 10;
The recommendations from reading this
to next read that
are as follows (sorry for the wall of text):
| this | that | shared |
| ---------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ------ |
| https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcptqe742u | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a | 8 |
| https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcptqe742u | 8 |
| https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfd24mubp2a | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a | 3 |
| https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcycunl72q | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfd24mubp2a | 2 |
| https://bsky.app/profile/did:plc:euz3fvldom5fvdf4vp7s5bak/post/3ltfcmb2stc27 | https://bsky.app/profile/did:plc:i52jkuv7iobxqvdabckhq6ui/post/3ltfclaqj6c2l | 2 |
| https://bsky.app/profile/did:plc:i52jkuv7iobxqvdabckhq6ui/post/3ltfclaqj6c2l | https://bsky.app/profile/did:plc:euz3fvldom5fvdf4vp7s5bak/post/3ltfcmb2stc27 | 2 |
| https://bsky.app/profile/did:plc:kdphmaoqeaouax2kis27556e/post/3ltfcbd3ros26 | https://bsky.app/profile/did:plc:4llrhdclvdlmmynkwsmg5tdc/post/3ltfcsbzj562a | 2 |
| https://bsky.app/profile/did:plc:2bfdxrptilof6qhw57koblcc/post/3ltfcqacbws2j | https://bsky.app/profile/did:plc:qcgbjlxqdwv2mcolldoljvbx/post/3ltfcr2ouck2e | 1 |
| https://bsky.app/profile/did:plc:2khndv2hbyuzwt66zona4gk5/post/3ltfcxij24k2u | https://bsky.app/profile/did:plc:aezl4prncwzcxn2daogn4ezt/post/3ltfcva4k6s2p | 1 |
| https://bsky.app/profile/did:plc:2ullnxyoicn2tzx2paseoflf/post/3ltfccjf7kk2i | https://bsky.app/profile/did:plc:5o6k7jvowuyaquloafzn3cfw/post/3ltfcdgndmc22 | 1 |
The high count recommendations are mostly back to the high volume posts, which doesn’t necessarily make them high quality recommendations. The next steps have been the study of Information Retrieval since time immemorial (any time preceding the advent of information retrieval): precision, recall, mean average precision, the Dice-Sørensen coefficient. Ideally, your careful thought can be more easily translated into action at this point, where you just need to type the SQL, and the results will be kept live and up to date.
We didn’t even crack open the likes and reposts, which are further (if quieter) signals of interest.
Wrapping up
The story told above is against the backdrop of continually changing data, but at one moment in time. Although the story evolves over time, all of the queries were executed in one Materialize transaction, and their all results tie out exactly. Constrained to the format of static text, that was the coolest trick I could pull off.
To get the real experience yourself, rather than a point in time view over some data, you should absolutely go and grab a copy of Materialize yourself. Our self-managed lets you try the real bits, and all of the analysis above fit within the bounds of the community edition.
Appendix
All of the results above are entirely consistent with each other, despite continually changing as their inputs change, because I performed them in one Materialize transaction. If you would like to do the same, having already defined the necessary views, you can copy/paste this SQL:
BEGIN;
SELECT COUNT(*), root_cid
FROM replies_parsed
GROUP BY root_cid
ORDER BY count DESC
LIMIT 10;
SELECT COUNT(*), url
FROM replies_parsed, urls
WHERE root_cid = urls.cid
GROUP BY 2
ORDER BY count DESC
LIMIT 10;
SELECT COUNT(DISTINCT author) as participants, COUNT(*), url
FROM replies_parsed, urls
WHERE root_cid = urls.cid
GROUP BY 3
ORDER BY count DESC
LIMIT 10;
SELECT participants, COUNT(*)
FROM (
SELECT COUNT(DISTINCT author) AS participants
FROM replies_parsed
GROUP BY root_cid
)
GROUP BY participants
ORDER BY count DESC;
SELECT *
FROM linked_convos
ORDER BY responses DESC
LIMIT 10;
SELECT u1.url AS this, u2.url AS that, shared
FROM recommended, urls u1, urls u2
WHERE r1 = u1.cid
AND r2 = u2.cid
ORDER BY shared DESC
LIMIT 10;
COMMIT;