Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!

Here's what I have:

  • 1000s of datasets
  • dataset keys:
    • all datasets have the same keys
    • 1 million keys (this may later be 10 or 20 million)
  • dataset columns:
    • each dataset has the same columns
    • 10 to 20 columns
    • most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
    • a few columns are "type_id" columns, since in a particular query we may want to only include certain type_ids
  • web application
    • user can choose which datasets they are interested in (anywhere from 15 to 1000)
    • application needs to present: key, and aggregated results (avg, stddev) of each column
  • updates of data:
    • an entire dataset can be added, dropped, or replaced/updated
    • would be cool to be able to add columns. But, if required, can just replace the entire dataset.
    • never add rows/keys to a dataset - so don't need a system with lots of fast writes
  • infrastructure:
    • currently two machines with 24 cores each
    • eventually, want ability to also run this on amazon

I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.

  • partitions are nice, since can easily add/drop/replace partitions
  • database is nice for filtering based on type_id
  • databases aren't easy for writing parallel queries
  • databases are good for structured data, and my data is not structured

As a proof of concept I tried out hadoop:

  • created a tab separated file per dataset for a particular type_id
  • uploaded to hdfs
  • map: retrieved a value/column for each key
  • reduce: computed average and standard deviation

From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).

Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?

thanks!

like image 988
anish Avatar asked Jul 26 '11 17:07

anish


1 Answers

Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds

HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.

check out http://en.wikipedia.org/wiki/Standard_deviation

stddev(X) = sqrt(E[X^2]- (E[X])^2)

this implies that you can get the stddev of AB by doing

sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)

like image 119
David Avatar answered Sep 24 '22 23:09

David