Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Free data warehouse - Infobright, Hadoop/Hive or what?

Tags:

I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to :

  • store them securely
  • use them to analysis (mostly time-oriented)
  • retrieve some raw data occasionally
  • It would be nice if it could be used with JasperReports or BIRT

My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL

On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 .

I heard about Hypertable, Pentaho, MongoDB ....

Do you have any recommendations ?

(Yes, I found some topics here, but it was year or two ago)

Edit: Other solutions : MonetDB, InfiniDB, LucidDB - what do you think?

like image 953
Piotr Gwiazda Avatar asked Mar 11 '10 07:03

Piotr Gwiazda


2 Answers

Am having the same problem here and made researches; two types of storages for BI :

  • column oriented. Free and known : monetDB, LucidDb, Infobright. InfiniDB
  • Distributed : hTable, Cassandra (also column oriented theoretically)
  • Document oriented / MongoDb, CouchDB

The answer depends on what you really need :

  • If your millions of row are loaded at once (nighly batch or so), InfiniDB or other column oriented DB are the best; They have great performance and are "BI oriented". http://www.d1solutions.ch/papers/d1_2010_hauenstein_real_life_performance_database.pdf And they won't require a setup of "nodes", "sharding" and other stuff that comes with distributed/"NoSQL" DBs.

http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/

  • If the rows are added in real time.. then column oriented DB are bad. You can either choose two have two separate DB (that's my choice : one noSQL for real feeding of the stats by the front, and real time stats. The other DB column-oriented for BI). Or turn towards something that mixes column oriented (for out requests) and distribution (for writes) / like Cassandra.

Document oriented DBs are not suited for BI, they are more useful for CRM/CMS issues where you need frequent access to a particular row

As for the exact choice inside a category, I'm still undecided. Cassandra in distributed, and Monet or InfiniDB for CODB, are leaders. Monet is reported to have problem loading very big tables because it runs indexes in memory.

like image 102
spiritoo Avatar answered Sep 28 '22 18:09

spiritoo


You could also consider GridSQL. Even for a single server, you can create multiple logical "nodes" to utilize multiple cores when processing queries.

GridSQL uses PostgreSQL, so you can also take advantage of partitioning tables into subtables to evaluate queries faster. You mentioned the data is time-oriented, so that would be a good candidate for creating subtables.

like image 32
Mason Avatar answered Sep 28 '22 18:09

Mason