Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scalable, fast, text file backed database engine?

I am dealing with large amounts of scientific data that are stored in tab separated .tsv files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv.

The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning.

As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers.

Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.

like image 594
Roman Zenka Avatar asked Jul 29 '10 20:07

Roman Zenka


People also ask

What database does Netflix use?

In the Netflix data center, we primarily use Oracle to persist data. In parts of the movie recommendation infrastructure, we use MySQL. Both are relational databases. In our data center, we do not currently use key-value stores for persistent storage.

What DBMS is used by Amazon?

Amazon DynamoDB It's a fully managed, multiregion, multimaster database with built-in security, backup and restore, and in-memory caching for internet-scale applications. DynamoDB can handle more than 10 trillion requests per day and support peaks of more than 20 million requests per second.

Which database is very fast?

MongoDB is suitable for hierarchical data storage and is almost 100 times faster than Relational Database Management System (RDBMS).

What database does Instagram use?

Instagram mainly uses two backend database systems: PostgreSQL and Cassandra. Both PostgreSQL and Cassandra have mature replication frameworks that work well as a globally consistent data store. Global data neatly maps to data stored in these servers.


1 Answers

There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.

I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:

  1. Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.
  2. Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then you're probably in good shape. Otherwise, a NoSQL database might be better (see Stefan's answer).

This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive.

You should be able to reduce your storage space and should not need twice as much storage space, as you state.

Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.

like image 78
Jason S Avatar answered Sep 23 '22 18:09

Jason S