Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Real time analytic processing system design

I am designing a system that should analyze large number of user transactions and produce aggregated measures (such as trends and etc). The system should work fast, be robust and scalable. System is java based (on Linux).

The data arrives from a system that generate log files (CSV based) of user transactions. The system generates a file every minute and each file contains the transactions of different users (sorted by time), each file may contain thousands of users.

A sample data structure for a CSV file:

10:30:01,user 1,...
10:30:01,user 1,...
10:30:02,user 78,...
10:30:02,user 2,...
10:30:03,user 1,...
10:30:04,user 2,...
. . .

The system I am planning should process the files and perform some analysis in real-time. It has to gather the input, send it to several algorithms and other systems and store computed results in a database. The database does not hold the actual input records but only high level aggregated analysis about the transactions. For example trends and etc.

The first algorithm I am planning to use requires for best operation at least 10 user records, if it can not find 10 records after 5 minutes, it should use what ever data available.

I would like to use Storm for the implementation, but I would prefer to leave this discussion in the design level as much as possible.

A list of system components:

  1. A task that monitors incoming files every minute.

  2. A task that read the file, parse it and make it available for other system components and algorithms.

  3. A component to buffer 10 records for a user (no longer than 5 minutes), when 10 records are gathered, or 5 minute have passed, it is time to send the data to the algorithm for further processing. Since the requirement is to supply at least 10 records for the algorithm, I thought of using Storm Field Grouping (which means the same task gets called for the same user) and track the collection of 10 user's records inside the task, of course I plan to have several of these tasks, each handles a portion of the users.

  4. There are other components that work on a single transaction, for them I plan on creating other tasks that receive each transaction as it gets parsed (in parallel to other tasks).

I need your help with #3.

What are the best practice for designing such a component? It is obvious that it needs to maintain the data for 10 records per users. A key value map may help, Is it better to have the map managed in the task itself or using a distributed cache? For example Redis a key value store (I never used it before).

Thanks for your help

like image 981
user1550706 Avatar asked Jul 25 '12 07:07

user1550706


1 Answers

I had worked with redis quite a bit. So, I'll comment on your thought of using redis

#3 has 3 requirements

  1. Buffer per user

  2. Buffer for 10 Tasks

  3. Should Expire every 5 min

1. Buffer Per User: Redis is just a key value store. Although it supports wide variety of datatypes, they are always values mapped to a STRING key. So, You should decide how to identify a user uniquely incase you need have per user buffer. Because In redis you will never get an error when you override a key new value. One solution might be check the existence before write.

2. Buffer for 10 Tasks: You obviously can implement a queue in redis. But restricting its size is left to you. Ex: Using LPUSH and LTRIM or Using LLEN to check the length and decide whether to trigger your process. The key associated with this queue should be the one you decided in part 1.

3. Buffer Expires in 5 min: This is a toughest task. In redis every key irrespective of underlying datatype it value has, can have an expiry. But the expiry process is silent. You won't get notified on expiry of any key. So, you will silently lose your buffer if you use this property. One work around for this is, having an index. Means, the index will map a timestamp to the keys who are all need to be expired at that timestamp value. Then in background you can read the index every minute and manually delete the key [after reading] out of redis and call your desired process with the buffer data. To have such an index you can look at Sorted Sets. Where timestamp will be your score and set member will be the keys [unique key per user decided in part 1 which maps to a queue] you wish to delete at that timestamp. You can do zrangebyscore to read all set members with specified timestamp

Overall:

Use Redis List to implement a queue.

Use LLEN to make sure you are not exceeding your 10 limit.

Whenever you create a new list make an entry into index [Sorted Set] with Score as Current Timestamp + 5 min and Value as the list's key.

When LLEN reaches 10, remember to read then remove the key from the index [sorted set] and from the db [delete the key->list]. Then trigger your process with data.

For every one min, generate current timestamp, read the index and for every key, read data then remove the key from db and trigger your process.

This might be my way to implement it. There might be some other better way to model your data in redis

like image 64
Tamil Avatar answered Nov 01 '22 04:11

Tamil