import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.zeros((1000000,1)))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')
ls -sh test*
11M test.csv 16M test.h5
If I use an even larger dataset then the effect is even bigger. Using an HDFStore
like below changes nothing.
store = pd.HDFStore('test.h5', table=True)
store['df'] = np.zeros((1000000,1))
store.close()
Edit: Never mind. The example is bad! Using some non-trivial numbers instead of zeros changes the story.
from numpy.random import rand
import pandas as pd
df = pd.DataFrame(data=rand(10000000,1))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')
ls -sh test*
260M test.csv 153M test.h5
Expressing numbers as floats should take less bytes than expressing them as strings of characters with one character per digit. This is generally true, except in my first example, in which all the numbers were '0.0'. Thus, not many characters were needed to represent the number, and so the string representation was smaller than the float representation.
HDF5 stores data in binary format native to a computing platform but portable across platforms. The binary format native to computers makes the format the more efficient for computers than text formats (e.g., . txt or . csv) that is meant for humans to read.
The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better. The two most impressive are feather and parquet .
This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.
Using pandas. One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
Briefly:
csv files are 'dumb': it is one character at a time, so if you print the (say, four-byte) float 1.0 to ten digits you really use that many bytes -- but the good news is that csv compresses well, so consider .csv.gz
.
hdf5 is a meta-format and the No Free Lunch theorem still holds: the entries and values need to be stored somewhere. Which may make hdf5 larger.
But you are overlooking a larger issue: csv is just text. Which has limited precision -- whereas hdf5 is one of several binary (serialization) formats which store data to higher precision. It really is apples to oranges in that regard too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With