Everyone knows, if you want to thread emails you use Jamie Zawinski's algorithm. But it's a new century, and there's a new messaging service.
What's the best algorithm for threading status updates posted on Twitter?
Things I'd definitely like it to cope with:
The easy part: using in_reply_to_status_id
,
in_reply_to_user_id
and in_reply_to_screen_name
.
(Incidentally, finding proper documentation of these values
would be useful in itself! Such documentation isn't
obviously linked to from
here,
for example.)
Good heuristics for inferring a "reply" relationship from
messages that mention a user with the @
convention but aren't
explicitly in reply to a particular message. These
"mentions" are provided in the "entities" element of
statuses now
if you request that. These heuristics might take into
account (a) the time between two status updates, (b) whether
there are subsquent replies between the two users, etc.
(Replies that consist of an old-style retweet with an
additional comment, as mentioned by user85509
below
are just an instance of this style of reply.)
Conversations that take place between more than two users.
Working with a set of tweets given to the algorithm, or all tweets on Twitter.
... but perhaps you can think of more.
With threads, your Twitter community can easily follow along with the conversation. Twitter gives you the ability to add up to 25 tweets in a single thread. And each tweet in that thread can feature images, GIFs, polls, etc. just like a standard tweet does.
You can't say a lot in a 280-character tweet, but you can with a Twitter thread, which connects a series of tweets. In principle, there is no limit to how many tweets can go in a thread; however, Twitter will only let you post 20 consecutive tweets in a single instance.
Click the retweet icon that is below the tweet. From the menu that appears, click the “Quote Tweet” option. On the next screen, add your comment to be displayed with the quoted comment. Once you are done adding your comment, publish the tweet by clicking the “Retweet” button at the top.
Since there's only been one answer, and the bounty deadline is approaching soon, I thought I should add a baseline answer so the bounty isn't automatically awarded to an answer that doesn't add much beyond what's in the question.
The obvious first step is to take your original set of tweets and follow all in_reply_to_status_id
links to build many directed acyclic graphs. These relationships you can be nearly 100% sure about. (You should follow the links even through tweets that aren't in the original set, adding those to the set of status updates that you're considering.)
Beyond that easy step, one has to do deal with the "mentions". Unlike in email threading, there's nothing helpful like a subject line that one can match on - this is inevitably going to be very error prone. The approach I would take is to create a feature vector for every possible relationship between status IDs that might be represented by mentions in that tweet, and then train a classifier to guess the best option, including a "no reply" option.
To work out the "every possible relationship" bit, start by considering every status update that mentions one or more other users and doesn't contain an in_reply_to_status_id
. Suppose an example of one of these tweets is: 1
@a @b no it isn't lol RT @c Yes, absolutely. /cc @stephenfry
... you would create a feature vector for the relationship between this update and every update with an earlier date in the timelines of @a
, @b
, @c
, and @stephenfry
for the last week (say) and one between that update and a special "no reply" update. Then you have to create a feature vector - you can add to this whatever you would like, but I would at least suggest adding:
following / followed
ratio for the author of the original update.The more of these one can come up with the better, since the classifier will only use those that turn out to be useful. I'd suggest trying a random forest classifier, which is conveniently implemented in Weka.
Next one needs a training set. This can be small at first - just enough to get a service that identifies conversations up-and-running. To this basic service, one would have to add a nice interface for correcting mismatched or falsely linked updates, so that users can correct them. Using this data one can build a bigger training set and a more accurate classifier.
1... which might be typical of the level of discourse on Twitter ;)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With