Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The conversion from csv to binary format reduces the file size abnormally

I have csv dataset of size 5.2GB (Taken from here). It has about 7M rows of dimension = 29. The values are of type float64. I want to convert this dataset into a binary file. To do so, I do the following simple lines:

import numpy as np
import pandas as pd

df = pd.read_csv('data.csv', sep=',')
np.asarray(df.values).tofile('data_binary.dat')

A snapshot of the data looks like this:

0.000000000000000000e+00,9.439358860254287720e-02,1.275558676570653915e-02,9.119330644607543945e-01,-9.083136916160583496e-02,-2.335745543241500854e-01,-1.054220795631408691e+00,-9.759366512298583984e-01,-1.067278265953063965e+00,-6.138502955436706543e-01,7.542607188224792480e-01,-9.256605505943298340e-01,-5.289512276649475098e-01,1.235263347625732422e+00,8.606486320495605469e-01,-2.320102453231811523e-01,-4.043335020542144775e-01,-1.559396624565124512e+00,-8.154401183128356934e-01,-1.376865267753601074e+00,6.759096682071685791e-02,1.372575879096984863e+00,-5.736824870109558105e-01,-1.368692040443420410e+00,-4.793794453144073486e-01,1.529256343841552734e+00,-5.757816433906555176e-01,-1.290232419967651367e+00,4.999999694824218750e+02
1.000000000000000000e+00,3.272003531455993652e-01,-2.395536154508590698e-01,-1.592038273811340332e+00,-2.324983835220336914e+00,-5.070934891700744629e-01,1.574625492095947266e+00,-1.050106048583984375e+00,9.686639308929443359e-01,1.312386870384216309e+00,7.542607188224792480e-01,-9.113077521324157715e-01,-1.718587398529052734e+00,3.751282095909118652e-01,8.606486320495605469e-01,-3.711451292037963867e-01,-5.625200271606445312e-01,-2.721544206142425537e-01,-8.154401183128356934e-01,-3.339428007602691650e-01,1.058411240577697754e+00,4.364815354347229004e-01,-5.736824870109558105e-01,-2.172690257430076599e-02,-5.791836977005004883e-01,-3.260441124439239502e-01,-2.024624943733215332e-01,-4.585579931735992432e-01,7.500000000000000000e+02

The new binary file data_binary.dat is reduced to 1.5GB. This is a huge reduction which made me wonder if something went wrong with the way I use to convert csv to binary format. Is this reduction expected? At least this much?

like image 619
Kristy Avatar asked May 12 '18 18:05

Kristy


People also ask

What is the difference between CSV and binary file?

CSV is a plain text format with a series of values separated by commas, whereas Excel is a binary file that holds information about all the worksheets in a workbook. CSV file can't perform operations on data, while Excel can perform operations on the data.

Why are binary files smaller than text files?

A binary file is usually very much smaller than a text file that contains an equivalent amount of data. For image, video, and audio data this is important. Small files save storage space, can be transmitted faster, and are processed faster. I/O with smaller files is faster, too, since there are fewer bytes to move.

Is CSV a binary file?

CSV is a plain text format, XLS is an Excel Sheets binary file format.

Do CSV files need to be opened in binary mode?

For the Python csv module in particular, the answer is simple: it's required by the documentation. If csvfile is a file object, it must be opened with the 'b' flag on platforms where that makes a difference.


1 Answers

Ok, so I went and downloaded a sample of the data. Each row is something like:

0.000000000000000000e+00,9.439358860254287720e-02,1.275558676570653915e-02 ...

Each individual number seems to have 25 character overall, and actually, 26 or so if you include the comma. So that's one byte per character, so about 25 bytes. Using a binary representation of a 64-bit floating point numbers will require ... 64 bits i.e. 8 bytes per number. So You should expect the binary file to be less than 1/3 the size, so this seems correct:

5.2/3 = 1.73...

A better estimate would be about 26 characters per number (including commas and line-breaks), so:

In [2]: (8/26)*5.2
Out[2]: 1.6

Seems legit.

like image 78
juanpa.arrivillaga Avatar answered Oct 05 '22 19:10

juanpa.arrivillaga