Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert large csv to hdf5

I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.

How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.

I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.

like image 885
jmilloy Avatar asked Nov 29 '14 14:11

jmilloy


People also ask

How do I convert a CSV file to HDF5?

If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.: import numpy as np import pandas as pd from IPython. display import clear_output CHUNK_SIZE = 5000000 filename = 'data. csv' dtypes = {'latitude': float, 'longitude': float} iter_csv = pd.

Why is HDF5 file so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

When should I use HDF5?

Supports Large, Complex Data: HDF5 is a compressed format that is designed to support large, heterogeneous, and complex datasets. Supports Data Slicing: "Data slicing", or extracting portions of the dataset as needed for analysis, means large files don't need to be completely read into the computers memory or RAM.


1 Answers

Use append=True in the call to to_hdf:

import numpy as np import pandas as pd  filename = '/tmp/test.h5'  df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B']) print(df) #    A  B # 0  0  1 # 1  2  3 # 2  4  5 # 3  6  7 # 4  8  9  # Save to HDF5 df.to_hdf(filename, 'data', mode='w', format='table') del df    # allow df to be garbage collected  # Append more data df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B']) df2.to_hdf(filename, 'data', append=True)  print(pd.read_hdf(filename, 'data')) 

yields

    A   B 0   0   1 1   2   3 2   4   5 3   6   7 4   8   9 0   0  10 1  20  30 2  40  50 3  60  70 4  80  90 

Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.

Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.


Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:

import numpy as np import pandas as pd  filename = '/tmp/test.h5' store = pd.HDFStore(filename)  for i in range(2):     df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])     store.append('data', df)  store.close()  store = pd.HDFStore(filename) data = store['data'] print(data) store.close() 

yields

    A   B 0   0   1 1   2   3 2   4   5 3   6   7 4   8   9 0   0  10 1  20  30 2  40  50 3  60  70 4  80  90 
like image 124
unutbu Avatar answered Sep 18 '22 22:09

unutbu