I am dealing with large amounts of scientific data that are stored in tab separated .tsv
files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv.
The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).
Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning.
As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers.
Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.
In the Netflix data center, we primarily use Oracle to persist data. In parts of the movie recommendation infrastructure, we use MySQL. Both are relational databases. In our data center, we do not currently use key-value stores for persistent storage.
Amazon DynamoDB It's a fully managed, multiregion, multimaster database with built-in security, backup and restore, and in-memory caching for internet-scale applications. DynamoDB can handle more than 10 trillion requests per day and support peaks of more than 20 million requests per second.
MongoDB is suitable for hierarchical data storage and is almost 100 times faster than Relational Database Management System (RDBMS).
Instagram mainly uses two backend database systems: PostgreSQL and Cassandra. Both PostgreSQL and Cassandra have mature replication frameworks that work well as a globally consistent data store. Global data neatly maps to data stored in these servers.
There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).
You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.
I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:
This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive.
You should be able to reduce your storage space and should not need twice as much storage space, as you state.
Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With