Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to retrieve/store millions of small binary objects

I am looking for a fast (as in huge performance, not quick fix) solution for persisting and retrieving tens of millions of small (around 1k) binary objects. Each object should have a unique ID for retrieval (preferably, a GUID or SHA). Additional requirements is that it should be usable from .NET and it shouldn't require additional software installation.

Currently, I am using an SQLite database with a single table for this job, but I want to get rid of the overhead of processing simple SQL instructions like SELECT data FROM store WHERE id = id.

I've also tested direct filesystem persistency under NTFS, but the performance degrades very fast as soon as it reaches half a millions objects.

P.S. By the way, objects never need to be deleted, and the insertion rate is very, very low. In fact, every time an object changes a new version is stored and the previous version remains. This is actually a requirement to support time-traveling.

Just adding some additional information to this thread:

To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem http://arxiv.org/abs/cs.DB/0701168

like image 819
Hugo Sereno Ferreira Avatar asked Jul 18 '09 17:07

Hugo Sereno Ferreira


People also ask

How do you store binary data?

Binary data can be stored in a table using the data type bytea or by using the Large Object feature which stores the binary data in a separate table in a special format and refers to that table by storing a value of type oid in your table.

What are database binaries?

Binary data is a type of data that is represented or displayed in the binary numeral system. Binary data is the only category of data that can be directly understood and executed by a computer. It is numerically represented by a combination of zeros and ones.

Does binary save space?

With the current ASCII/Unicode encoding, data stored as plain text takes up more space than stored as what's commonly referred to as a binary file. A (plain) text file is a binary file; it's stored in a computer as a sequence of 0 and 1 (binary means 2 values).


2 Answers

You may be able to lessen the performance problems of NTFS by breaking the object's GUID identifier up into pieces and using them as directory names. That way, each directory only contains a limited number of subdirectories or files.

e.g. if the identifier is aaaa-bb-cc-ddddeeee, the path to the item would be c:\store\aaaa\bbcc\dddd\eeee.dat, limiting each directory to no more than 64k subitems.

like image 95
Daniel Earwicker Avatar answered Oct 05 '22 22:10

Daniel Earwicker


You need call a prepare function only once per statement, with parameter denoted e.g. by ? (so SELECT data FROM store WHERE id=? is the statement you'd prepare); then what you do "millions of times" is just to bind the parameter into the prepared statement and call sqlite_step -- these are fast operations. Worth benchmarking if blob open might not be even faster. IOW, I recommend sticking with SQLite and digging into its low-level interface (from managed C++ if you must) for maximum performance -- it's really an amazing little engine, and it has often surprised me favorably with its performance!

like image 37
Alex Martelli Avatar answered Oct 05 '22 22:10

Alex Martelli