I am dealing with large amounts of scientific data that are stored in tab separated <code>.tsv</code> files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv. The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space). Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning. As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers. Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.

<blockquote> There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space). </blockquote> You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse. I would consider taking the existing data, and instead of storing raw, processing it in the following two ways: <ol> <li>Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.</li> <li>Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then you're probably in good shape. Otherwise, a NoSQL database might be better (see Stefan's answer).</li> </ol> This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive. You should be able to reduce your storage space and should not need twice as much storage space, as you state. Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.

Scalable, fast, text file backed database engine?

Tags:

database

csv

large-data

scientific-computing

plaintext

I am dealing with large amounts of scientific data that are stored in tab separated .tsv files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv.

The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning.

As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers.

Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.

594

asked Jul 29 '10 20:07

Roman Zenka

1 Answers

There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).

You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.

I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:

Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.
Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then you're probably in good shape. Otherwise, a NoSQL database might be better (see Stefan's answer).

This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive.

You should be able to reduce your storage space and should not need twice as much storage space, as you state.

Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.

answered Sep 23 '22 18:09

Jason S

Related questions
                            
                                How to auto generate primary key ID properly with Hibernate inserting records
                            
                                Isolation Level vs Optimistic Locking-Hibernate , JPA
                            
                                Spring Data JPA - finding number of rows from the database
                            
                                DELETE without lock in MariaDB/MySQL?`(InnoDB)
                            
                                Adding new column using Update-Database in Entity Framework Core
                            
                                Validate failed: Detected applied migration not resolved locally | Flyway
                            
                                Where does elementor stores the data from the content that is created from the frontend view?
                            
                                Has and belongs to many relationship with multiple databases
                            
                                Generalization vs Specialization of DB table [closed]
                            
                                How do I store a rating in a song?
                            
                                What's the meaning of ORM?
                            
                                Loosely Coupled Database Design - How To?
                            
                                Data Structure for storing a sorting field to efficiently allow modifications
                            
                                Creating a database connection pool
                            
                                ignore insert of rows that violate duplicate key index
                            
                                Database connectivity Delphi
                            
                                Which Oracle table uses a sequence?
                            
                                How should I store an Java Enum in JavaDB?
                            
                                Database best practices
                            
                                Is it possible to store javascript in a database?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With