Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a large csv file to hdf5 in python?

I have a dataset that is too large to directly read into memory. And I don't want to upgrade the machine. From my readings, HDF5 may be a suitable solution for my problem. But I am not sure how to iteratively write the dataframe into the HDF5 file since I can not load the csv file as a dataframe object.

So my question is how to write a large CSV file into HDF5 file with python pandas.

like image 823
Yan Song Avatar asked Oct 07 '17 13:10

Yan Song


People also ask

How do I convert a CSV file to HDF5?

If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.: import numpy as np import pandas as pd from IPython. display import clear_output CHUNK_SIZE = 5000000 filename = 'data. csv' dtypes = {'latitude': float, 'longitude': float} iter_csv = pd.

How do I import a large CSV file into Python?

read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.

Is HDF5 faster than csv?

(a) Categorical Features as Strings An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better.

Why is HDF5 file so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.


1 Answers

You can read CSV file in chunks using chunksize parameter and append each chunk to the HDF file:

hdf_key = 'hdf_key'
df_cols_to_index = [...] # list of columns (labels) that should be indexed
store = pd.HDFStore(hdf_filename)

for chunk in pd.read_csv(csv_filename, chunksize=500000):
    # don't index data columns in each iteration - we'll do it later ...
    store.append(hdf_key, chunk, data_columns=df_cols_to_index, index=False)
    # index data columns in HDFStore

store.create_table_index(hdf_key, columns=df_cols_to_index, optlevel=9, kind='full')
store.close()
like image 129
MaxU - stop WAR against UA Avatar answered Nov 14 '22 23:11

MaxU - stop WAR against UA