Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are CSV files smaller than HDF5 files when writing with Pandas?

import numpy as np
import pandas as pd

df = pd.DataFrame(data=np.zeros((1000000,1)))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
11M test.csv  16M test.h5

If I use an even larger dataset then the effect is even bigger. Using an HDFStore like below changes nothing.

store = pd.HDFStore('test.h5', table=True)
store['df'] = np.zeros((1000000,1))
store.close()

Edit: Never mind. The example is bad! Using some non-trivial numbers instead of zeros changes the story.

from numpy.random import rand
import pandas as pd

df = pd.DataFrame(data=rand(10000000,1))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
260M test.csv  153M test.h5

Expressing numbers as floats should take less bytes than expressing them as strings of characters with one character per digit. This is generally true, except in my first example, in which all the numbers were '0.0'. Thus, not many characters were needed to represent the number, and so the string representation was smaller than the float representation.

like image 452
jeffalstott Avatar asked Mar 09 '15 04:03

jeffalstott


People also ask

Is HDF5 better than CSV?

HDF5 stores data in binary format native to a computing platform but portable across platforms. The binary format native to computers makes the format the more efficient for computers than text formats (e.g., . txt or . csv) that is meant for humans to read.

Is HDF5 faster than CSV?

The following picture shows averaged I/O times for each data format. An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better. The two most impressive are feather and parquet .

Why are HDF5 files so large?

This is probably due to your chunk layout - the more chunk sizes are small the more your HDF5 file will be bloated. Try to find an optimal balance between chunk sizes (to solve your use-case properly) and the overhead (size-wise) that they introduce in the HDF5 file.

How does pandas work with large CSV files?

Using pandas. One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.


1 Answers

Briefly:

  • csv files are 'dumb': it is one character at a time, so if you print the (say, four-byte) float 1.0 to ten digits you really use that many bytes -- but the good news is that csv compresses well, so consider .csv.gz.

  • hdf5 is a meta-format and the No Free Lunch theorem still holds: the entries and values need to be stored somewhere. Which may make hdf5 larger.

But you are overlooking a larger issue: csv is just text. Which has limited precision -- whereas hdf5 is one of several binary (serialization) formats which store data to higher precision. It really is apples to oranges in that regard too.

like image 100
Dirk Eddelbuettel Avatar answered Oct 04 '22 07:10

Dirk Eddelbuettel