Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large Data Sets - NoSQL, NewSQL, SQL..? Brain Fried

I'm in need of some advice. I working on a new start-up in the data mining field. This is basically the spin off of a research project.

Any way we have a large about of data that is unstructured, we are doing various NLP, classification and clustering analysis on this data.

We have millions of messages ranging from twitter messages, blog posts, forum posts, new paper articles, reports etc etc... All text. All up we are taking about 300GB+ of text data and growing every day (about 10GB per day growth)!

So we need somewhere to store all of this information in a format that we can actually process and query and get relative real-time results.

Any way we need somewhere to store of this data...

As this is a new start-up we really cant/dont want to pay for a licensed product, e.g. Enterprise edition of VoltDB, Oracle, etc is out of reach.

I was thinking this may be the perfect application for a Non-Relation "NoSQL" database such as Apache Cassandra or Hadoop/HBase (column family), MongoDB (document), VoltDB (community edn) or MySQL.

Currently all the data is in tsv text files and is processed as its written to file. Needless to say its painful and it means the whole thing is stuck in the one process and we cant query it. It works but its way to limited for the richness of what we could be doing with this data set.

Any way I was hoping someone could share their experience using any of the above tools or any recommendations for this use case (large set of text data unstructured) for Natural Language Processing, classification, clustering, frequency gathering, real-time analysis etc..?

My biggest fear is that MySQL wont be able to handle the sheer volumes of data going forward. This thing will be in the terabyte range come the end of the year, so we are in part trying to get a head of the curve and growth by implementing a scalable solution that will allow us to easily query data...

I'm thinking non-rel/NoSQL column family database like HBase is best, for us adding new data sources all the time (crawlers, streaming APIs etc) it will be much easier if we have a unstructured model.

Any help would be greatly appreciated! Hell there might even be a job in it :)

Cheers!

like image 414
NightWolf Avatar asked May 09 '11 01:05

NightWolf


1 Answers

You need to think carefuly about what types of queries you will need to run over these docs. Cassandra etc may well be a good fit if your queries are basic, but richer SQL-like queries are not possible. The largest Cassandra deployments are of the order of 150TB, so your data volumes should not be a problem; but Cassandra performance may be overkill and will sacrifice query richness.

If you just want text indexing, then also consider Lucene, as I think for batch indexing Lucene can now handle over 100 GB/hour, so overnight indexing of 1TB would be possible - and Lucene now claims comparable speeds for incremental indexing too...

like image 88
DNA Avatar answered Sep 28 '22 07:09

DNA