Pandas dataframe CSV reduce disk size

Question

For my university assignment, I have to produce a csv file with all the distances of the airports of the world... the problem is that my csv file weight 151Mb. I want to reduce it as much as i can: This is my csv:

enter image description here

and this is my code:

# drop all features we don't need
for attribute in df:
    if attribute not in ('NAME', 'COUNTRY', 'IATA', 'LAT', 'LNG'):
        df = df.drop(attribute, axis=1)

# create a dictionary of airports, each airport has the following structure:
# IATA : (NAME, COUNTRY, LAT, LNG)
airport_dict = {}
for airport in df.itertuples():
    airport_dict[airport[3]] = (airport[1], airport[2], airport[4], airport[5])
    
# From tutorial 4 solution:
airportcodes=list(airport_dict)
airportdists=pd.DataFrame()
for i, airport_code1 in enumerate(airportcodes):
    airport1 = airport_dict[airport_code1]
    dists=[]
    for j, airport_code2 in enumerate(airportcodes):
        if j > i:
            airport2 = airport_dict[airport_code2]
            dists.append(distanceBetweenAirports(airport1[2],airport1[3],airport2[2],airport2[3]))
        else:
        # little edit: no need to calculate the distance twice, all duplicates are set to 0 distance
            dists.append(0)
    airportdists[i]=dists
airportdists.columns=airportcodes
airportdists.index=airportcodes

# set all 0 distance values to NaN
airportdists = airportdists.replace(0, np.nan)
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv')

I also tried re-indexing it before saving:

# remove all NaN values
airportdists = airportdists.stack().reset_index()
airportdists.columns = ['airport1','airport2','distance']

but the result is a dataframe with 3 columns and 17 million columns and a disk size of 419Mb... quite not an improvement...

Can you help me shrink the size of my csv? Thank you!

Frenchy · Accepted Answer

I have done a similar application in the past; here's what I will do:

It is difficult to shrink your file, but if your application needs to have for example a distance between an airport from others, I suggest you to create 9541 files, each file will be the distance of an airport to others and its name will be name of airport.

In this case the loading of file is really fast.

Thangarajan Pannerselvam · Answer

My suggestion will be instead of storing as a CSV try to store in Key Value pair data structure like JSON. It will be very fast on retrieval. Or try parquet file format that will consume 1/4 of the CSV file storage.

import pandas as pd
import numpy as np
from pathlib import Path
from string import ascii_letters

#created a dataframe
df = pd.DataFrame(np.random.randint(0,10000,size=(1000000, 52)),columns=list(ascii_letters))

df.to_csv('csv_store.csv',index=False)
print('CSV Consumend {} MB'.format(Path('csv_store.csv').stat().st_size*0.000001))
#CSV Consumend 255.22423999999998 MB

df.to_parquet('parquate_store',index=False)
print('Parquet Consumed {} MB'.format(Path('parquate_store').stat().st_size*0.000001))
#Parquet Consumed 93.221154 MB

Pandas dataframe CSV reduce disk size

Tags:

python

pandas

dataframe

csv

compression

Fabio Magarelli

2 Answers

Frenchy

Thangarajan Pannerselvam

Recent Activity

Donate For Us

Pandas dataframe CSV reduce disk size

Tags:

python

pandas

dataframe

csv

compression

Fabio Magarelli

2 Answers

Frenchy

Thangarajan Pannerselvam

Related questions

Recent Activity

Donate For Us