I have csv
dataset of size 5.2
GB (Taken from here). It has about 7M rows of dimension = 29
. The values are of type float64
. I want to convert this dataset into a binary file. To do so, I do the following simple lines:
import numpy as np
import pandas as pd
df = pd.read_csv('data.csv', sep=',')
np.asarray(df.values).tofile('data_binary.dat')
A snapshot of the data looks like this:
0.000000000000000000e+00,9.439358860254287720e-02,1.275558676570653915e-02,9.119330644607543945e-01,-9.083136916160583496e-02,-2.335745543241500854e-01,-1.054220795631408691e+00,-9.759366512298583984e-01,-1.067278265953063965e+00,-6.138502955436706543e-01,7.542607188224792480e-01,-9.256605505943298340e-01,-5.289512276649475098e-01,1.235263347625732422e+00,8.606486320495605469e-01,-2.320102453231811523e-01,-4.043335020542144775e-01,-1.559396624565124512e+00,-8.154401183128356934e-01,-1.376865267753601074e+00,6.759096682071685791e-02,1.372575879096984863e+00,-5.736824870109558105e-01,-1.368692040443420410e+00,-4.793794453144073486e-01,1.529256343841552734e+00,-5.757816433906555176e-01,-1.290232419967651367e+00,4.999999694824218750e+02
1.000000000000000000e+00,3.272003531455993652e-01,-2.395536154508590698e-01,-1.592038273811340332e+00,-2.324983835220336914e+00,-5.070934891700744629e-01,1.574625492095947266e+00,-1.050106048583984375e+00,9.686639308929443359e-01,1.312386870384216309e+00,7.542607188224792480e-01,-9.113077521324157715e-01,-1.718587398529052734e+00,3.751282095909118652e-01,8.606486320495605469e-01,-3.711451292037963867e-01,-5.625200271606445312e-01,-2.721544206142425537e-01,-8.154401183128356934e-01,-3.339428007602691650e-01,1.058411240577697754e+00,4.364815354347229004e-01,-5.736824870109558105e-01,-2.172690257430076599e-02,-5.791836977005004883e-01,-3.260441124439239502e-01,-2.024624943733215332e-01,-4.585579931735992432e-01,7.500000000000000000e+02
The new binary file data_binary.dat
is reduced to 1.5
GB. This is a huge reduction which made me wonder if something went wrong with the way I use to convert csv
to binary format. Is this reduction expected? At least this much?
CSV is a plain text format with a series of values separated by commas, whereas Excel is a binary file that holds information about all the worksheets in a workbook. CSV file can't perform operations on data, while Excel can perform operations on the data.
A binary file is usually very much smaller than a text file that contains an equivalent amount of data. For image, video, and audio data this is important. Small files save storage space, can be transmitted faster, and are processed faster. I/O with smaller files is faster, too, since there are fewer bytes to move.
CSV is a plain text format, XLS is an Excel Sheets binary file format.
For the Python csv module in particular, the answer is simple: it's required by the documentation. If csvfile is a file object, it must be opened with the 'b' flag on platforms where that makes a difference.
Ok, so I went and downloaded a sample of the data. Each row is something like:
0.000000000000000000e+00,9.439358860254287720e-02,1.275558676570653915e-02 ...
Each individual number seems to have 25 character overall, and actually, 26 or so if you include the comma. So that's one byte per character, so about 25 bytes. Using a binary representation of a 64-bit floating point numbers will require ... 64 bits i.e. 8 bytes per number. So You should expect the binary file to be less than 1/3 the size, so this seems correct:
5.2/3 = 1.73...
A better estimate would be about 26 characters per number (including commas and line-breaks), so:
In [2]: (8/26)*5.2
Out[2]: 1.6
Seems legit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With