Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The best data storage format in term of weight and performance (e.g., Txt, Asc, Bin, etc.)?

Can someone guide me to find the best storage format in term of read/write speed, performance, weight (file size) in order to store heavy matrices (of floating numbers with a constant precision) into a file (to HDD).

I have been using ASCII, Text and Binary format. And lets say for the same matrix size (e.g. 10000x10000x200) and numbers precision (e.g., 5 significant digits) I found that Binary format gave the best results in general, then ASCII and Text in term of speed of access/write and weight in general (I haven't do any actual testing).

With that being said, is there a standard data storage format better then binary in my situation? If not, Is there any way I can optimize my data structure to get better performances while saving/reading?

PS. I can use C, C++ or Matlab (doesn't matter which one for me) if one can help reaching better results.

like image 307
Maiss Avatar asked Mar 13 '12 03:03

Maiss


1 Answers

Binary will be much faster in general. If you are using floats, you are using 4 bytes per number, instead of 1 byte per character of the number - so the number 5.34182 is 4 bytes instead of 7 bytes plus a delimiter.

Going further though, you can probably do better. Your disk doesn't read data byte by byte, rather it reads data in blocks, and generally you want to avoid reading more blocks than you have to. The real reason that binary format is faster isn't that it takes fewer bytes, but that it takes fewer blocks (a product of taking fewer bytes). What this means is that you want to minimize the size on disk, because reading from a disk is an order of magnitude slower than reading from RAM - disk accesses are measured in milliseconds, while RAM accesses are in microseconds.

So now what you can do? If your matrix is sparse, you can store just the elements that are non-zero, which will save you a lot of space. So instead of storing each point, store a pair of (index, value) for each entry. This means that each entry is now 8 bytes instead of 4, but if more than half of the matrix is zero, you save a lot of space.

Finally, compression can help a lot here. Of course more compression means more CPU time to decompress the matrix, but it can also mean faster disk reads. Here, you really have to experiment - at the simple end of the spectrum, Run Length Encoding is easy to implement and often works surprisingly well. This works because if you are storing for small integers and "simple" floats, most of the bytes are zero. This also works well if the same number is repeated multiple times, which does happen in matrices. I'd also recommend checking out more advanced schemes, such as bzip2, which while more computationally complex, could significantly decrease the size on disk. Alas, compression tends to be very domain specific, so you have to experiment here. What works in one domain doesn't always work in another.

like image 73
mindvirus Avatar answered Oct 13 '22 01:10

mindvirus