Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternatives to traditional relational databases for activity streams

I'm wondering if some other non-relational database would be a good fit for activity streams - sort of like what you see on Facebook, Flickr (http://www.flickr.com/activity), etc. Right now, I'm using MySQL but it's pretty taxing (I have tens of millions of activity records) and since they are basically read-only once written and always viewed chronologically, I was thinking that an alternative DB might work well.

The activities are things like:

  • 6 PM: John favorited Bacon
  • 5:30 PM: Jane commented on Snow Crash
  • 5:15 PM: Jane added a photo of Bacon to her album

The catch is that unlike Twitter and some other systems, I can't just simply append activities to lists for each user who is interested in the activity - if I could it looks like Redis would be a good fit (with its list operations).

I need to be able to do the following:

  • Pull activities for a set or subset of people who you are following ("John" and "Jane"), in reverse date order
  • Pull activities for a thing (like "Bacon") in reverse date order
  • Filter by activity type ("favorite", "comment")
  • Store at least 30 million activities
  • Ideally, if you added or removed a person who you are following, your activity stream would reflect the change.

I have been doing this with MySQL. My "activities" table is as compact as I could make it, the keys are as small as possible, and the it is indexed appropriately. It works, but it just feels like the wrong tool for this job.

Is anybody doing anything like this outside of a traditional RDBMS?

Update November 2009: It's too early to answer my own question, but my current solution is to stick with MySQL but augment with Redis for fast access to the fresh activity stream data. More information in my answer here: How to implement the activity stream in a social network...

Update August 2014: Years later, I'm still using MySQL as the system of record and using Redis for very fast access to the most recent activities for each user. Dealing with schema changes on a massive MySQL table has become a non-issue thanks to pt-online-schema-change

like image 610
outcassed Avatar asked Aug 27 '09 17:08

outcassed


2 Answers

I'd really, really, suggest stay with MySQL (or a RDBMS) until you fully understand the situation.

I have no idea how much performance or much data you plan on using, but 30M rows is not very many.

If you need to optimise certain range scans, you can do this with (for example) InnoDB by choosing a (implicitly clustered) primary key judiciously, and/or denormalising where necessary.

But like most things, make it work first, then fix performance problems you detect in your performance test lab on production-grade hardware.


EDIT:Some other points:

  • key/value database such as Cassandra, Voldermort etc, do not generally support secondary indexes
  • Therefore, you cannot do a CREATE INDEX
  • Most of them also don't do range scans (even on the main index) because they're using hashing to implement partitioning (which they mostly do).
  • Therefore they also don't do range expiry (DELETE FROM tbl WHERE ts < NOW() - INTERVAL 30 DAYS)
  • Your application must do ALL of this itself or manage without it; secondary indexes are really the killer
  • ALTER TABLE ... ADD INDEX takes quite a long time in e.g. MySQL with a large table, but at least you don't have to write much code to do it. In a "nosql" database, it will also take a long time BUT also you have to write heaps and heaps of code to maintain the new secondary index, expire it correctly, AND modify your queries to use it.

In short... you can't use a key/value database as a shortcut to avoid ALTER TABLE.

like image 175
MarkR Avatar answered Sep 27 '22 21:09

MarkR


I am also planning on moving away from SQL. I have been looking at CouchDB, which looks promising. Looking at your requirements, I think all can be done with CouchDB views, and the list api.

like image 27
Zed Avatar answered Sep 27 '22 20:09

Zed