Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database choices for big data [closed]

I have many text files, their total size is about 300GB ~ 400GB. They are all in this format

key1 value_a
key1 value_b
key1 value_c
key2 value_d
key3 value_e
....

each line is composed by a key and a value. I want to create a database which can let me query all value of a key. For example, when I query key1, value_a, value_b and value_c are returned.

First of all, inserting all these files into the database is a big problem. I try to insert a few GBs size chunk to MySQL MyISAM table with LOAD DATA INFILE syntax. But it appears MySQL can't utilize multicores for inserting data. It's as slow as hell. So, I think MySQL is not a good choice here for so many records.

Also, I need to update or recreate the database periodically, weekly, or even daily if possible, therefore, insertion speed is important for me.

It's not possible for a single node to do the computing and insertion efficiently, to be efficient, I think it's better to perform the insertion in different nodes parallely.

For example,

node1 -> compute and store 0-99999.txt
node2 -> compute and store 10000-199999.txt
node3 -> compute and store 20000-299999.txt
....

So, here comes the first criteria.

Criteria 1. Fast insertion speed in distributed batch manner.

Then, as you can see in the text file example, it's better to provide multiple same key to different values. Just like key1 maps to value_a/value_b/value_c in the example.

Criteria 2. Multiple keys are allowed

Then, I will need to query keys in the database. No relational or complex join query is required, all I need is simple key/value querying. The important part is that multiple key to same value

Criteria 3. Simple and fast key value querying.

I know there are HBase/Cassandra/MongoDB/Redis.... and so on, but I'm not familiar with all of them, not sure which one fits my needs. So, the question is - what database to use? If none of them fits my needs, I even plan to build my own, but it takes efforts :/

Thanks.

like image 963
Fang-Pen Lin Avatar asked Apr 05 '12 08:04

Fang-Pen Lin


1 Answers

There are probably a lot of systems that would fit your needs. Your requirements make things pleasantly easy in a couple ways:

  • Because you don't need any cross-key operations, you could use multiple databases, dividing keys between them via hash or range sharding. This is an easy way to solve the lack of parallelism that you observed with MySQL and probably would observe with a lot of other database systems.
  • Because you never do any online updates, you can just build an immutable database in bulk and then query it for the rest of the day/week. I'd expect you'd get a lot better performance this way.

I'd be inclined to build a set of hash-sharded LevelDB tables. That is, I wouldn't use an actual leveldb::DB which supports a more complex data structure (a stack of tables and a log) so that you can do online updates; instead, I'd directly use leveldb::Table and leveldb::TableBuilder objects (no log, only one table for a given key). This is a very efficient format for querying. And if your input files are already sorted like in your example, the table building will be extremely efficient as well. You can achieve whatever parallelism you desire by increasing the number of shards - if you're using a 16-core, 16-disk machine to build the database, then use at least 16 shards, all generated in parallel. If you're using 16 16-core, 16-disk machines, at least 256 shards. If you have a lot fewer disks than cores as many people do these days, try both, but you may find fewer shards are better to avoid seeks. If you're careful, I think you can basically max out the disk throughput while building tables, and that's saying a lot as I'd expect the tables to be noticeably smaller than your input files due to the key prefix compression (and optionally Snappy block compression). You'll mostly avoid seeks because aside from a relatively small index that you can typically buffer in RAM, the keys in the leveldb tables are stored in the same order as you read them from the input files, assuming again that your input files are already sorted. If they're not, you may want enough shards that you can sort a shard in RAM then write it out, perhaps processing shards more sequentially.

like image 117
Scott Lamb Avatar answered Oct 28 '22 01:10

Scott Lamb