Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summary statistics on Large csv file using python pandas

Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.

In this case first i need to create a DataFrame for all the 10gb csv data.

text_csv=Pandas.read_csv("target.csv")
df=Pandas.DataFrame(text_csv)
df.describe()

Does this mean all the 10gb will get loaded in to memory and calculate the statistics?

like image 880
Subbi reddy dwarampudi Avatar asked Feb 23 '16 06:02

Subbi reddy dwarampudi


2 Answers

Yes, I think you are right. And you can omit df=Pandas.DataFrame(text_csv), because output from read_csv is DataFrame:

import pandas as pd

df = pd.read_csv("target.csv")
print df.describe()

Or you can use dask:

import dask.dataframe as dd

df = dd.read_csv('target.csv.csv')

print df.describe()

You can use parameter chunksize of read_csv, but you get output TextParser not DataFrame, so then you need concat:

import pandas as pd
import io

temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
#chunksize = 2 for testing
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
<pandas.io.parsers.TextFileReader object at 0x000000001995ADA0>
df = pd.concat(tp, ignore_index=True)
print df.describe()
               a           b
count  15.000000   15.000000
mean    3.333333  527.600000
std     1.877181    5.082182
min     1.000000  519.000000
25%     2.000000  524.500000
50%     3.000000  528.000000
75%     5.000000  531.500000
max     6.000000  535.000000

You can convert TextFileReader to DataFrame, but aggregate this output can be difficult:

import pandas as pd

import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""

#after testing replace io.StringIO(temp) to filename
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp

dfs = []
for t in tp:
    df = pd.DataFrame(t)
    df1 = df.describe()
    dfs.append(df1.T)

df2 = pd.concat(dfs)
print df2
   count   mean        std  min     25%    50%     75%  max
a      2    1.0   0.000000    1    1.00    1.0    1.00    1
b      2  525.5   0.707107  525  525.25  525.5  525.75  526
a      2    1.5   0.707107    1    1.25    1.5    1.75    2
b      2  530.0   4.242641  527  528.50  530.0  531.50  533
a      2    2.0   0.000000    2    2.00    2.0    2.00    2
b      2  530.0   2.828427  528  529.00  530.0  531.00  532
a      2    3.0   0.000000    3    3.00    3.0    3.00    3
b      2  526.5  10.606602  519  522.75  526.5  530.25  534
a      2    3.5   0.707107    3    3.25    3.5    3.75    4
b      2  532.5   3.535534  530  531.25  532.5  533.75  535
a      2    5.0   0.000000    5    5.00    5.0    5.00    5
b      2  530.0   1.414214  529  529.50  530.0  530.50  531
a      2    6.0   0.000000    6    6.00    6.0    6.00    6
b      2  520.5   0.707107  520  520.25  520.5  520.75  521
a      1    6.0        NaN    6    6.00    6.0    6.00    6
b      1  524.0        NaN  524  524.00  524.0  524.00  524
like image 90
jezrael Avatar answered Sep 28 '22 14:09

jezrael


Seems there is no limitation of file size for pandas.read_csv method.

According to @fickludd's and @Sebastian Raschka's answer in Large, persistent DataFrame in pandas, you can use iterator=True and chunksize=xxx to load the giant csv file and calculate the statistics you want:

import pandas as pd

df = pd.read_csv('some_data.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with chunks of 1000 rows.
partial_desc = df.describe()

And aggregate all the partial describe info all yourself.

like image 33
eric Avatar answered Sep 28 '22 14:09

eric