<pre class="prettyprint"><code>import numpy as np import pandas as pd df = pd.DataFrame(data=np.zeros((1000000,1))) df.to_csv('test.csv') df.to_hdf('test.h5', 'df') ls -sh test* 11M test.csv 16M test.h5 </code></pre> If I use an even larger dataset then the effect is even bigger. Using an <code>HDFStore</code> like below changes nothing. <pre class="prettyprint"><code>store = pd.HDFStore('test.h5', table=True) store['df'] = np.zeros((1000000,1)) store.close() </code></pre> Edit: Never mind. The example is bad! Using some non-trivial numbers instead of zeros changes the story. <pre class="prettyprint"><code>from numpy.random import rand import pandas as pd df = pd.DataFrame(data=rand(10000000,1)) df.to_csv('test.csv') df.to_hdf('test.h5', 'df') ls -sh test* 260M test.csv 153M test.h5 </code></pre> Expressing numbers as floats should take less bytes than expressing them as strings of characters with one character per digit. This is generally true, except in my first example, in which all the numbers were '0.0'. Thus, not many characters were needed to represent the number, and so the string representation was smaller than the float representation.

Briefly: <ul> <li>csv files are 'dumb': it is one character at a time, so if you print the (say, four-byte) float 1.0 to ten digits you really use that many bytes -- but the good news is that csv compresses well, so consider <code>.csv.gz</code>.</li> <li>hdf5 is a meta-format and the No Free Lunch theorem still holds: the entries and values need to be stored somewhere. Which may make hdf5 larger.</li> </ul> But you are overlooking a larger issue: csv is just text. Which has limited precision -- whereas hdf5 is one of several binary (serialization) formats which store data to higher precision. It really is apples to oranges in that regard too.

Why are CSV files smaller than HDF5 files when writing with Pandas?

Tags:

python

pandas

csv

hdf5

hdf

import numpy as np
import pandas as pd

df = pd.DataFrame(data=np.zeros((1000000,1)))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
11M test.csv  16M test.h5

If I use an even larger dataset then the effect is even bigger. Using an HDFStore like below changes nothing.

store = pd.HDFStore('test.h5', table=True)
store['df'] = np.zeros((1000000,1))
store.close()

Edit: Never mind. The example is bad! Using some non-trivial numbers instead of zeros changes the story.

from numpy.random import rand
import pandas as pd

df = pd.DataFrame(data=rand(10000000,1))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
260M test.csv  153M test.h5

Expressing numbers as floats should take less bytes than expressing them as strings of characters with one character per digit. This is generally true, except in my first example, in which all the numbers were '0.0'. Thus, not many characters were needed to represent the number, and so the string representation was smaller than the float representation.

452

asked Mar 09 '15 04:03

jeffalstott

1 Answers

Briefly:

csv files are 'dumb': it is one character at a time, so if you print the (say, four-byte) float 1.0 to ten digits you really use that many bytes -- but the good news is that csv compresses well, so consider .csv.gz.
hdf5 is a meta-format and the No Free Lunch theorem still holds: the entries and values need to be stored somewhere. Which may make hdf5 larger.

But you are overlooking a larger issue: csv is just text. Which has limited precision -- whereas hdf5 is one of several binary (serialization) formats which store data to higher precision. It really is apples to oranges in that regard too.

100

answered Oct 04 '22 07:10

Dirk Eddelbuettel

Related questions
                            
                                Python md5 hashes of same gzipped file are inconsistent
                            
                                python points to global installation even after virtualenv activation
                            
                                How to prevent PyDev's autopep8 import formatter from moving site.addsitedir() calls?
                            
                                PySide Qt tr() does not translate, translate() does - context wrong?
                            
                                Comparing two lists of coordinates in python and using coordinate values to assign values
                            
                                Python Curses - module 'curses' has no attribute 'LINES'
                            
                                Using Python to Resize Images when Greater than 1280 on either side
                            
                                Numpy: get 1D array as 2D array without reshape
                            
                                Python Pyramid periodic task
                            
                                Understanding A* heuristics for single goal maze
                            
                                Share choices across Django apps
                            
                                label matplotlib imshow axes with strings
                            
                                cython: memory view of ndarray of strings (or direct ndarray indexing)
                            
                                Django compilemessages doesn't create .mo files
                            
                                Are null bytes allowed in unicode strings in PostgreSQL via Python?
                            
                                How to unserstand the code using izip_longest to chunk a list?
                            
                                django: use namedtuple instead of dict for **kwargs?
                            
                                Creating a numpy array in C from an allocated array is causing memory leaks
                            
                                A variable shared between views and initialized in AppConfig
                            
                                My python installation is broken/corrupted. How do I fix it?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With