Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is HDF5 different from a folder with files?

I'm working on an open source project dealing with adding metadata to folders. The provided (Python) API lets you browse and access metadata like it was just another folder. Because it is just another folder.

\folder\.meta\folder\somedata.json 

Then I came across HDF5 and its derivation Alembic.

Reading up on HDF5 in the book Python and HDF5 I was looking for benefits to using it compared to using files in folders, but most of what I came across spoke about the benefits of a hierarchical file-format in terms of its simplicity in adding data via its API:

>>> import h5py >>> f = h5py.File("weather.hdf5") >>> f["/15/temperature"] = 21 

Or its ability to read only certain parts of it upon request (e.g. random access), and parallel execution of a single HDF5 file (e.g. for multiprocessing)

You could mount HDF5 files, https://github.com/zjttoefs/hdfuse5

It even boasts a strong yet simple foundation concept of Groups and Datasets which from wiki reads:

  • Datasets, which are multidimensional arrays of a homogeneous type
  • Groups, which are container structures which can hold datasets and other groups

Replace Dataset with File and Group with Folder and the whole feature-set sounds to me like what files in folders are already fully capable of doing.

For every benefit I came across, not one stood out as being exclusive to HDF5.

So my question is, if I were to give you one HDF5 file and one folder with files, both with identical content, in which scenario would HDF5 be better suited?

Edit:

Having gotten some responses about the portability of HDF5.

It sounds lovely and all, but I still haven't been given an example, a scenario, in which an HDF5 would out-do a folder with files. Why would someone consider using HDF5 when a folder is readable on any computer, any file-system, over a network, supports "parallel I/O", is readable by humans without an HDF5 interpreter.

I would go as far as to say, a folder with files is far more portable than any HDF5.

Edit 2:

Thucydides411 just gave an example of a scenario where portability matters. https://stackoverflow.com/a/28512028/478949

I think what I'm taking away from the answers in this thread is that HDF5 is well suited for when you need the organisational structure of files and folders, like in the example scenario above, with lots (millions) small (~1 byte) data structures; like individual numbers or strings. That it makes up for what file-systems lack by providing a "sub file-system" favouring the small and many as opposed to few and large.

In computer graphics, we use it to store geometric models and arbitrary data about individual vertices which seems to align quite well with it's use in the scientific community.

like image 902
Marcus Ottosson Avatar asked Mar 02 '14 09:03

Marcus Ottosson


People also ask

When we should use HDF5 file system?

The Hierarchical Data Format version 5 (HDF5), is an open source file format that supports large, complex, heterogeneous data. HDF5 uses a "file directory" like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer.

How are HDF5 files structured?

HDF5 files are organized in a hierarchical structure, with two primary structures: groups and datasets. HDF5 group: a grouping structure containing instances of zero or more groups or datasets, together with supporting metadata. HDF5 dataset: a multidimensional array of data elements, together with supporting metadata.

Why are HDF5 files so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

Does HDF5 compress data?

One of the most powerful features of HDF5 is its ability to store and modify compressed data. The HDF5 Library comes with two pre-defined compression methods, GNUzip (Gzip) and Szip and has the capability of using third-party compression methods as well.


1 Answers

As someone who developed a scientific project that went from using folders of files to HDF5, I think I can shed some light on the advantages of HDF5.

When I began my project, I was operating on small test datasets, and producing small amounts of output, in the range of kilobytes. I began with the easiest data format, tables encoded as ASCII. For each object I processed, I produced on ASCII table.

I began applying my code to groups of objects, which meant writing multiple ASCII tables at the end of each run, along with an additional ASCII table containing output related to the entire group. For each group, I now had a folder that looked like:

+ group |    |-- object 1 |    |-- object 2 |    |-- ... |    |-- object N |    |-- summary 

At this point, I began running into my first difficulties. ASCII files are very slow to read and write, and they don't pack numeric information very efficiently, because each digit takes a full Byte to encode, rather than ~3.3 bits. So I switched over to writing each object as a custom binary file, which sped up I/O and decreased file size.

As I scaled up to processing large numbers (tens of thousands to millions) of groups, I suddenly found myself dealing with an extremely large number of files and folders. Having too many small files can be a problem for many filesystems (many filesystems are limited in the number of files they can store, regardless of how much disk space there is). I also began to find that when I would try to do post-processing on my entire dataset, the disk I/O to read many small files was starting to take up an appreciable amount of time. I tried to solve these problems by consolidating my files, so that I only produced two files for each group:

+ group 1 |    |-- objects |    |-- summary + group 2 |    |-- objects |    |-- summary ... 

I also wanted to compress my data, so I began creating .tar.gz files for collections of groups.

At this point, my whole data scheme was getting very cumbersome, and there was a risk that if I wanted to hand my data to someone else, it would take a lot of effort to explain to them how to use it. The binary files that contained the objects, for example, had their own internal structure that existed only in a README file in a repository and on a pad of paper in my office. Whoever wanted to read one of my combined object binary files would have to know the byte offset, type and endianness of each metadata entry in the header, and the byte offset of every object in the file. If they didn't, the file would be gibberish to them.

The way I was grouping and compressing data also posed problems. Let's say I wanted to find one object. I would have to locate the .tar.gz file it was in, unzip the entire contents of the archive to a temporary folder, navigate to the group I was interested in, and retrieve the object with my own custom API to read my binary files. After I was done, I would delete the temporarily unzipped files. It was not an elegant solution.

At this point, I decided to switch to a standard format. HDF5 was attractive for a number of reasons. Firstly, I could keep the overall organization of my data into groups, object datasets and summary datasets. Secondly, I could ditch my custom binary file I/O API, and just use a multidimensional array dataset to store all the objects in a group. I could even create arrays of more complicated datatypes, like arrays of C structs, without having to meticulously document the byte offsets of every entry. Next, HDF5 has chunked compression which can be completely transparent to the end user of the data. Because the compression is chunked, if I think users are going to want to look at individual objects, I can have each object compressed in a separate chunk, so that only the part of the dataset the user is interested in needs to be decompressed. Chunked compression is an extremely powerful feature.

Finally, I can just give a single file to someone now, without having to explain much about how it's internally organized. The end user can read the file in Python, C, Fortran, or h5ls on the commandline or the GUI HDFView, and see what's inside. That wasn't possible with my custom binary format, not to mention my .tar.gz collections.

Sure, it's possible to replicate everything you can do with HDF5 with folders, ASCII and custom binary files. That's what I originally did, but it became a major headache, and in the end, HDF5 did everything I was kluging together in an efficient and portable way.

like image 87
apdnu Avatar answered Sep 20 '22 09:09

apdnu