I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv
and to_hdf
methods in its io_tools
, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.: import numpy as np import pandas as pd from IPython. display import clear_output CHUNK_SIZE = 5000000 filename = 'data. csv' dtypes = {'latitude': float, 'longitude': float} iter_csv = pd.
This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.
Supports Large, Complex Data: HDF5 is a compressed format that is designed to support large, heterogeneous, and complex datasets. Supports Data Slicing: "Data slicing", or extracting portions of the dataset as needed for analysis, means large files don't need to be completely read into the computers memory or RAM.
Use append=True
in the call to to_hdf
:
import numpy as np import pandas as pd filename = '/tmp/test.h5' df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B']) print(df) # A B # 0 0 1 # 1 2 3 # 2 4 5 # 3 6 7 # 4 8 9 # Save to HDF5 df.to_hdf(filename, 'data', mode='w', format='table') del df # allow df to be garbage collected # Append more data df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B']) df2.to_hdf(filename, 'data', append=True) print(pd.read_hdf(filename, 'data'))
yields
A B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 0 0 10 1 20 30 2 40 50 3 60 70 4 80 90
Note that you need to use format='table'
in the first call to df.to_hdf
to make the table appendable. Otherwise, the format is 'fixed'
by default, which is faster for reading and writing, but creates a table which can not be appended to.
Thus, you can process each CSV one at a time, use append=True
to build the hdf5 file. Then overwrite the DataFrame or use del df
to allow the old DataFrame to be garbage collected.
Alternatively, instead of calling df.to_hdf
, you could append to a HDFStore:
import numpy as np import pandas as pd filename = '/tmp/test.h5' store = pd.HDFStore(filename) for i in range(2): df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B']) store.append('data', df) store.close() store = pd.HDFStore(filename) data = store['data'] print(data) store.close()
yields
A B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 0 0 10 1 20 30 2 40 50 3 60 70 4 80 90
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With