I have many text files, their total size is about 300GB ~ 400GB. They are all in this format
key1 value_a
key1 value_b
key1 value_c
key2 value_d
key3 value_e
....
each line is composed by a key and a value. I want to create a database which can let me query all value of a key. For example, when I query key1, value_a, value_b and value_c are returned.
First of all, inserting all these files into the database is a big problem. I try to insert a few GBs size chunk to MySQL MyISAM table with LOAD DATA INFILE syntax. But it appears MySQL can't utilize multicores for inserting data. It's as slow as hell. So, I think MySQL is not a good choice here for so many records.
Also, I need to update or recreate the database periodically, weekly, or even daily if possible, therefore, insertion speed is important for me.
It's not possible for a single node to do the computing and insertion efficiently, to be efficient, I think it's better to perform the insertion in different nodes parallely.
For example,
node1 -> compute and store 0-99999.txt
node2 -> compute and store 10000-199999.txt
node3 -> compute and store 20000-299999.txt
....
So, here comes the first criteria.
Criteria 1. Fast insertion speed in distributed batch manner.
Then, as you can see in the text file example, it's better to provide multiple same key to different values. Just like key1 maps to value_a/value_b/value_c in the example.
Criteria 2. Multiple keys are allowed
Then, I will need to query keys in the database. No relational or complex join query is required, all I need is simple key/value querying. The important part is that multiple key to same value
Criteria 3. Simple and fast key value querying.
I know there are HBase/Cassandra/MongoDB/Redis.... and so on, but I'm not familiar with all of them, not sure which one fits my needs. So, the question is - what database to use? If none of them fits my needs, I even plan to build my own, but it takes efforts :/
Thanks.
There are probably a lot of systems that would fit your needs. Your requirements make things pleasantly easy in a couple ways:
I'd be inclined to build a set of hash-sharded LevelDB tables. That is, I wouldn't use an actual leveldb::DB
which supports a more complex data structure (a stack of tables and a log) so that you can do online updates; instead, I'd directly use leveldb::Table
and leveldb::TableBuilder
objects (no log, only one table for a given key). This is a very efficient format for querying. And if your input files are already sorted like in your example, the table building will be extremely efficient as well. You can achieve whatever parallelism you desire by increasing the number of shards - if you're using a 16-core, 16-disk machine to build the database, then use at least 16 shards, all generated in parallel. If you're using 16 16-core, 16-disk machines, at least 256 shards. If you have a lot fewer disks than cores as many people do these days, try both, but you may find fewer shards are better to avoid seeks. If you're careful, I think you can basically max out the disk throughput while building tables, and that's saying a lot as I'd expect the tables to be noticeably smaller than your input files due to the key prefix compression (and optionally Snappy block compression). You'll mostly avoid seeks because aside from a relatively small index that you can typically buffer in RAM, the keys in the leveldb tables are stored in the same order as you read them from the input files, assuming again that your input files are already sorted. If they're not, you may want enough shards that you can sort a shard in RAM then write it out, perhaps processing shards more sequentially.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With