Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DynamoDB/Redis activity stream help needed

I have chosen DynamoDB as the backend for my activity feed/events data but am having some trouble deciding on the best data structure to use.

Firstly I should explain that activity ID's for each user are stored in Redis sorted sets (for personal profile activities) and in Redis lists for an individuals activity stream, meaning that any activity tables I have in DymaoDB will only need a hash key and have no need for range, local or global indexes since they are essentially being indexed in Redis.

We are doing this so that we can effectively aggregate feed and profile activity data by manipulating the ID lists and sets in Redis.

Anyway... Our initial plan was to create a DynamoDB table for each month, store the activity data there... then dial down the provisioned throughput for older tables as they age, keeping the most recent data fast and available while keeping the cost down for old data.

While this technique works very well for the activity stream itself, it does not work when viewing a users profile (and their own historic activities) Since, in a manner similar to Facebooks timeline, users are able to view all they we back to their birth and are able to add custom life events into their profile. This requirement would mean having a table for each month of the last 80 years or so, therefore, we need something else.

Currently we are toying with the idea of splitting the activity tables into activity types. e.g:

activities_comments
actvities_likes
actiities_uploads
activities_posts

... And so on.

We would need around 20 tables to cover all our current activity types. Using this method would allow us to selectively provision throughput for the most commonly occurring activities and to us seems preferable to keeping a single activity table with a huge and expensive provisioned throughput.

In redis, we would simply add a table suffix to each activity id to allow us to know which table the activity metadata is stored in then we would be able to query the data as follows:

For activity streams:

  • activityIDs for each users stream stored in Redis lists (containing activity data from all their followers after aggregation)
  • Keep the list truncated to say 500 items to keep redis memory requirements down
  • Simply Query using Redis lrange to get the most recent activities 20 activities
  • use DynamoDB batchGetitem to pull the ID's out of the various tables.... rinse and repeat as users scroll down their stream.

For users profiles

  • Aggregated activitID's stored in Redis Sorted sets for each user with the timestamp as score
  • use Redis zrangebyscore to get specific months or time ranges of activityID's from
    the sorted set (i.e. user can quickly pull their activity history for July 2012 should they wish)
  • Again use batchGetItem to retrieve the data from DynamoDB

The aggregating of data will be done offline, where we will analyze the redis lists/sorted sets for similar activities occurring in a given time period, then create a new activity with the aggregated metadata, add it to dynamoDB, add the new activity to Redis at the correct place, and finally remove all old relating activities from Redis lists/sets.

e.g.

  • 260 likes of the same photo are found all within one week.
  • We build a SINGLE new activity with the metadata reflecting this, containing a list of the old activityID's (incase we ever need to retrieve them)
  • Remove the 260 activityID's from the redis lists/sets and replace with the single new activityID.

The above is actually substantially more complicated and takes into account most most popular post and activity weighting which we have developed... but it gives you a rough idea.

So, now that I've described the solution that we are currently thinking of going with, what I would like to know is:

  1. Does this sound like a good/fast/flexible/scalable solution?
  2. Are there any alternative data structures which might be better than what I have described?
  3. Are there any glaring issues with the above scenario that we might not have thought of?

I know this is kind of a vague question and that there's a lot to read, but any opinions or comments would be greatly appreciated.

NOTE: For the sake of completeness I should state that activity ID's are pushed out on write into a users followers activity streams in Redis. Though we are not adverse to switching to fan out on read, should someone convince us of its benefits in their answer.

like image 517
gordyr Avatar asked Dec 12 '13 17:12

gordyr


2 Answers

Building activity feeds and newsfeeds on DynamoDB requires a lot of additional infrastructure due to how you propagate data (fanout on write) which usually results in a lot of provisioning drama and high costs.

I wrote an article describing the challenges with running newsfeeds on DynamoDB here.

Disclaimer: I am the CTO and one the co-founders of Stream

like image 67
Tommaso Barbugli Avatar answered Oct 14 '22 02:10

Tommaso Barbugli


You could enable DynamoDB Streams on your activity tables and attach Lambda functions to them to incrementally aggregate activities in your Redis structures. Using time-series tables is a recommended practice for managing the cost of provisioning throughput on hot/cold data. However, there are practical limitations, like the per-account-per-region limit of 256 tables that may limit your ability to keep all of the data in DynamoDB. The same Lambda function could maintain caches of activity counts with a sliding window that you could use for aggregating many small activities into aggregate activities.

like image 45
Alexander Patrikalakis Avatar answered Oct 14 '22 04:10

Alexander Patrikalakis