The conversion from csv to binary format reduces the file size abnormally

Tags:

I have csv dataset of size 5.2GB (Taken from here). It has about 7M rows of dimension = 29. The values are of type float64. I want to convert this dataset into a binary file. To do so, I do the following simple lines:

import numpy as np
import pandas as pd

df = pd.read_csv('data.csv', sep=',')
np.asarray(df.values).tofile('data_binary.dat')

A snapshot of the data looks like this:

0.000000000000000000e+00,9.439358860254287720e-02,1.275558676570653915e-02,9.119330644607543945e-01,-9.083136916160583496e-02,-2.335745543241500854e-01,-1.054220795631408691e+00,-9.759366512298583984e-01,-1.067278265953063965e+00,-6.138502955436706543e-01,7.542607188224792480e-01,-9.256605505943298340e-01,-5.289512276649475098e-01,1.235263347625732422e+00,8.606486320495605469e-01,-2.320102453231811523e-01,-4.043335020542144775e-01,-1.559396624565124512e+00,-8.154401183128356934e-01,-1.376865267753601074e+00,6.759096682071685791e-02,1.372575879096984863e+00,-5.736824870109558105e-01,-1.368692040443420410e+00,-4.793794453144073486e-01,1.529256343841552734e+00,-5.757816433906555176e-01,-1.290232419967651367e+00,4.999999694824218750e+02
1.000000000000000000e+00,3.272003531455993652e-01,-2.395536154508590698e-01,-1.592038273811340332e+00,-2.324983835220336914e+00,-5.070934891700744629e-01,1.574625492095947266e+00,-1.050106048583984375e+00,9.686639308929443359e-01,1.312386870384216309e+00,7.542607188224792480e-01,-9.113077521324157715e-01,-1.718587398529052734e+00,3.751282095909118652e-01,8.606486320495605469e-01,-3.711451292037963867e-01,-5.625200271606445312e-01,-2.721544206142425537e-01,-8.154401183128356934e-01,-3.339428007602691650e-01,1.058411240577697754e+00,4.364815354347229004e-01,-5.736824870109558105e-01,-2.172690257430076599e-02,-5.791836977005004883e-01,-3.260441124439239502e-01,-2.024624943733215332e-01,-4.585579931735992432e-01,7.500000000000000000e+02

The new binary file data_binary.dat is reduced to 1.5GB. This is a huge reduction which made me wonder if something went wrong with the way I use to convert csv to binary format. Is this reduction expected? At least this much?

619

asked May 12 '18 18:05

Kristy

1 Answers

Ok, so I went and downloaded a sample of the data. Each row is something like:

0.000000000000000000e+00,9.439358860254287720e-02,1.275558676570653915e-02 ...

Each individual number seems to have 25 character overall, and actually, 26 or so if you include the comma. So that's one byte per character, so about 25 bytes. Using a binary representation of a 64-bit floating point numbers will require ... 64 bits i.e. 8 bytes per number. So You should expect the binary file to be less than 1/3 the size, so this seems correct:

5.2/3 = 1.73...

A better estimate would be about 26 characters per number (including commas and line-breaks), so:

In [2]: (8/26)*5.2
Out[2]: 1.6

Seems legit.

answered Oct 05 '22 19:10

juanpa.arrivillaga

Related questions
                            
                                How to write patterns for use with re.VERBOSE when they contain meaningful whitespace?
                            
                                Why does mypy say I have too many arguments
                            
                                Airflow - Task Instance in EMR operator
                            
                                Google OAuth token request returns "invalid_client": "Unauthorized"
                            
                                Python - Reading and writing csv files with utf-8 encoding
                            
                                Python Pandas Sum Values in Columns If date between 2 dates
                            
                                Why would I use int( input().strip() ) instead of just int( input() ) in Python?
                            
                                Tk(), Toplevel() and winfo_toplevel(). Difference between them and how and when to use effectively?
                            
                                How to keep track of instances of python objects in a reliable way?
                            
                                imdecode returns None Python opencv2
                            
                                Searching for all Unicode variation of hyphens in Python
                            
                                Regex - Replace \\n and \n in string by <br> but not \\\\n
                            
                                moto not mocking ec2?
                            
                                Catch all exceptions except user abort
                            
                                Pyinstaller: Cannot open shared object libpython3.5m.so.1.0
                            
                                Limit/Filter Foreign Key Choices in Django Admin
                            
                                How can I get the noun clause that is the object of a certain verb?
                            
                                Rand Index function (clustering performance evaluation)
                            
                                2d boolean selection in 3d matrix
                            
                                Bounding box on objects based on color python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

The conversion from csv to binary format reduces the file size abnormally

Tags:

python

python-3.x

pandas

numpy

Kristy

People also ask

1 Answers

juanpa.arrivillaga

Recent Activity

Donate For Us