Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

h5py error reading virtual dataset into NumPy array

Tags:

python

hdf5

h5py

I'm trying to load data from a virtual HDF dataset created with h5py and having some troubles properly loading the data.

Here is an example of my issue:

import h5py
import tools as ut

virtual  = h5py.File(ut.params.paths.virtual)

a = virtual['part2/index'][:]

print(virtual['part2/index'][-1])
print(a[-1])

This outputs:

[890176134]
[0]

Why? Why is the last element different when I copy the data into a NumPy array (value=[0]) vs when I read directly from the dataset (value=[890176134])?

Am I doing something horribly wrong without realizing it?

Thanks a lot.

like image 808
pnjun Avatar asked Feb 10 '26 12:02

pnjun


2 Answers

Yes, you should get the same values from the Virtual Dataset or an array created from the Virtual Dataset. It's hard to diagnose the error without more details about the data.

I used the h5py example vds_simple.py to demonstrate how this should behave. Most of the code builds the HDF5 files. The section at end the compares the output. Code below modified from the example to create a variable number of source files (defined by a0=).

Code to create the 'a0' source files with sample data:

a0 = 5000
# create sample data
data = np.arange(0, 100).reshape(1, 100)

# Create source files (0.h5 to a0.h5)
for n in range(a0):
    with h5py.File(f"{n}.h5", "w") as f:
        row_data = data + n
        f.create_dataset("data", data=row_data)

Code to define the virtual layout and assemble virtual dataset:

# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(a0, 100), dtype="i4")
for n in range(a0):
    filename = "{}.h5".format(n)
    vsource = h5py.VirtualSource(filename, "data", shape=(100,))
    layout[n] = vsource

# Add virtual dataset to output file
with h5py.File("VDS.h5", "w", libver="latest") as f:
    f.create_virtual_dataset("vdata", layout)

Code to read and print the data:

# read data back
# virtual dataset is transparent for reader!
with h5py.File("VDS.h5", "r") as f:
    arr = f["vdata"][:]

    print("\nFirst 10 Elements in First Row:")
    print("Virtual dataset:")
    print(f["vdata"][0, :10])
    print("Reading vdata into Array:")
    print(arr[0, :10])

    print("Last 10 Elements of Last Row:")
    print("Virtual dataset:")
    print(f["vdata"][-1,-10:])
    print("Reading vdata into Array:")
    print(arr[-1,-10:])    

Output from code above (w/ a0=5000):

First 10 Elements in First Row:
Virtual dataset:
[0 1 2 3 4 5 6 7 8 9]
Reading vdata into Array:
[0 1 2 3 4 5 6 7 8 9]
Last 10 Elements of Last Row:
Virtual dataset:
[5089 5090 5091 5092 5093 5094 5095 5096 5097 5098]
Reading vdata into Array:
[5089 5090 5091 5092 5093 5094 5095 5096 5097 5098]
like image 175
kcw78 Avatar answered Feb 13 '26 04:02

kcw78


The accepted answer confirms that this should work, but OP observes that it doesn't, hence it is a bug that needs to be reported.

In this answer, I want to provide more information on this issue and constructive approaches to fix it, including code.


Description of the issue:

A "virtual" HDF5 dataset is composed of several smaller HDF5 datasets. When loading the virtual dataset, it can be sometimes observed that the "last entries" being loaded are actually empty (i.e. they contain the default value, e.g. zeros, empty strings...). Crucially, this always happens towards the end of the loaded files.

But, when we check the individual HDF5 files separately, we do observe that they are actually not empty.

A first hypothesis is that there are some race conditions, or some file is not being properly flushed. But flushing everything, adding delays, futures, or setting the environment variable HDF5_USE_FILE_LOCKING to FALSE, as suggested here, did not help.


Explanation of the issue:

Luckily, the issue is clearly explained by Thomas Kluyver in this forum entry: https://forum.hdfgroup.org/t/virtual-dataset-in-read-write-file-missing-data-from-read-only-file/5647

  1. Operative Systems limit the maximum number of files that can be open by a single process at the same time. In Linux, this can be checked via ulimit -Hn, and is typically something like 1024.
  2. When we open the virtual HDF5 dataset, the process keeps opening the sub-files, and each sub-file counts as one towards the limit.
  3. When the limit is surpassed, this is silently omitted by HDF5, and the corresponding entries in the virtual dataset are filled with default values (i.e. "empty data"). We will thus observe that, after some point, our virtual dataset has empty values, but the sub-files are not empty.

Solution/Workaround:

It seems that, if we plan to aggregate very large amounts of files, this issue will persist, and it cannot be fixed via library or programming language, since it stems from the OS. Users with admin rights may be able to extend the number of allowed files, but this is typically not allowed in computational clusters where HDF5 is most needed, and setting this number to a larger constant is IMO a ticking bomb anyway.

So the solution seems to be getting rid of the virtual structure, and aggregate all sub-files into a single, main HDF5 database.

The gist below contains a static class that documents and assists to perform all required steps:

https://gist.github.com/andres-fr/00a73aa2cd6ef5cf609a0446ec0c5d91


Discussion:

Note that the virtual structure is very convenient since we may want several concurrent processes to write on the database at the same time, and this is generally not possible or discouraged when we have a single centralized file.

But once written, if the virtual structure comprises >>100 files, it seems that converting it to centralized is the only way to ensure we can circumvent OS restrictions from the lib/language side.

I'd love to be wrong, so if anybody has better ideas, please do share! Cheers
Andres

like image 38
fr_andres Avatar answered Feb 13 '26 03:02

fr_andres



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!