Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to structure a DynamoDB database to allow queries for trending posts?

I am planning on using the following formula to calculate "trending" posts:

Trending Score = (p - 1) / (t + 2)^1.5

p = votes (points) from users. t = time since submission in hours.

I am looking for advice on how to structure my database tables so that I can query for trending posts with DynamoDB (a nosql database service from Amazon).

DynamoDB requires a Primary Key for each item in a table. The Primary Key can consist of 2 parts: the Hash Attribute (string or number) and the Range Attribute (string or number). The Hash Attribute must be unique for each item and is required. The Range Attribute is optional, but if used DynamoDB will build a sorted range index on the Range Attribute.

The structure I had in mind goes as follows:

TableName: Users

HashAttribute:  user_id
RangeAttribute: NONE
OtherFields: first_name, last_name

TableName: Posts

HashAttribute:  post_id
RangeAttribute: NONE
OtherFields: user_id,title, content, points, categories[ ]

TableName: Categories

HashAttribute:  category_name
RangeAttribute: post_id
OtherFields: title, content, points

TableName: Counters

HashAttribute:  counter_name
RangeAttribute: NONE
OtherFields: counter_value

So here is an example of the types of requests I would make with the following table setup (example: user_id=100):

User Action 1:

User creates a new post and tags the post for 2 categories (baseball,soccer)

Query (1):

Check current value for the counter_name='post_id' and increment+1 and use the new post_id

Query (2): Insert the following into the Posts table:

post_id=value_from_query_1, user_id=100, title=user_generated, content=user_generated, points=0, categories=['baseball','soccer']

Query (3):

Insert the following into the Categories table:

category_name='baseball', post_id=value_from_query_1, title=user_generated, content=user_generated, points=0

Query (4):

Insert the following into the Categories table:

category_name='soccer', post_id=value_from_query_1, title=user_generated, content=user_generated, points=0



The end goal is to be able to conduct the following types of queries:

1. Query for trending posts

2. Query for posts in a certain category

3. Query for posts with the highest point values

Does anyone have any idea how I could structure my tables so that I could do a query for trending posts? Or is this something I give the up the ability to do by switching to DynamoDB?

like image 730
Jason Pudzianowski Avatar asked Feb 18 '12 05:02

Jason Pudzianowski


People also ask

Can DynamoDB handle complex queries?

DynamoDB has many attractive features. For example, it can automatically scale to handle trillions of calls in a 24-hour period. It can be used as a key-value store or a document database, and it can handle complex access patterns much faster than a typical relational database.

How do you improve read and writes for DynamoDB?

You can increase your DynamoDB throughput by several times, by parallelizing reads/writes over multiple partitions. Use DynamoDB as an attribute store rather than as a document store. This will not only reduce the read/write costs but also improve the performance of your operations considerably.

Can you run SQL queries on DynamoDB?

In Amazon DynamoDB, you can use either the DynamoDB API, or PartiQL, a SQL-compatible query language, to query an item from a table. With Amazon DynamoDB the Query action lets you retrieve data in a similar fashion.


1 Answers

I'm starting with a note on your comment with the timestamp vs post_id.
Since you are going to use DynamoDB as your post_id generator, there is a scalability issue right there. Those numbers are inherently unscalable and you better off using a date object. If you need to create posts in a crazy speed time you can start reading about how twitter are doing it http://blog.twitter.com/2010/announcing-snowflake

Now let's get back to your trending check:
I believe your scenario is misusing DynamoDB.
Let's say you have one HOT category that has most posts in it. Basically you will have to scan the whole posts (since the data isn't spread well) and for each start to look at the points and do the comparisons in your server. This will just not work or will be very expensive since each time you will probably use all your reserved read units capacity.

The DynamoDB approach for those type of trends checking is using MapReduce
Read here how to implement those: http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html

I can't specify a time, but I believe you will find this approach scalable - though you won't be able to use it often.

On another note - you could keep a list of the "top 10/100" trendy questions and you update them in "real-time" when a post is upvoted - you get the list, check if it needs to be updated with the newly upvoted question and save it back to the db if needed.

like image 91
Chen Harel Avatar answered Sep 20 '22 14:09

Chen Harel