Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby+transform on 50 million rows is taking 3 hours

I am using pandas module. In my DataFrame 3 fields are account ,month and salary.

    account month              Salary
    1       201501             10000
    2       201506             20000
    2       201506             20000
    3       201508             30000
    3       201508             30000
    3       201506             10000
    3       201506             10000
    3       201506             10000
    3       201506             10000

I am doing groupby on Account and Month and convert salary to percent of salary of group it belongs.

MyDataFrame['salary'] = MyDataFrame.groupby(['account'], ['month'])['salary'].transform(lambda x: x/x.sum())

Now MyDataFrame becomes like below table

    account month              Salary
    1       201501             1
    2       201506             .5
    2       201506             .5
    3       201508             .5
    3       201508             .5
    3       201506             .25
    3       201506             .25
    3       201506             .25
    3       201506             .25

Problem is: Operation on 50 million such rows is taking 3 hours. I executed groupyby separately it is fast takes 5 seconds only.I think it is transform taking long time here. is there any way to improve performance ?

Update: To provide more clarity adding example Some account holder received salary 2000 in Jun and 8000 in July so his proportion becomes .2 for Jun and .8 for July. my purpose is to calculate this proportion.

like image 228
Vipin Avatar asked Aug 08 '15 07:08

Vipin


2 Answers

Well you need be more explicit and show exactly what you are doing. This is something pandas excels at.

Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.

In [20]: np.random.seed(1234)

In [21]: ngroups = 1000

In [22]: nrows = 50000000

In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups)

In [24]:  df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows),
                 'date' : dates.take(np.random.randint(0,ngroups,size=nrows)),
                 'values' : np.random.randn(nrows) })


In [25]: 

In [25]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000000 entries, 0 to 49999999
Data columns (total 3 columns):
account    int64
date       datetime64[ns]
values     float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 1.5 GB

In [26]: df.head()
Out[26]: 
   account       date    values
0      815 2048-02-01 -0.412587
1      723 2023-01-01 -0.098131
2      294 2020-11-01 -2.899752
3       53 2058-02-01 -0.469925
4      204 2080-11-01  1.389950

In [27]: %timeit df.groupby(['account','date']).sum()
1 loops, best of 3: 8.08 s per loop

If you want to transform the output, then doit like this

In [37]: g = df.groupby(['account','date'])['values']

In [38]: result = 100*df['values']/g.transform('sum')

In [41]: result.head()
Out[41]: 
0     4.688957
1    -2.340621
2   -80.042089
3   -13.813078
4   -70.857014
dtype: float64

In [43]: len(result)
Out[43]: 50000000

In [42]: %timeit 100*df['values']/g.transform('sum')
1 loops, best of 3: 30.9 s per loop

Take a bit longer. But again this should be a relatively fast operation.

like image 174
Jeff Avatar answered Oct 11 '22 15:10

Jeff


I would use a different approach First Sort,

MyDataFrame.sort(['account','month'],inplace=True)

Then iterate and sum

(account,month)=('','') #some invalid values
salary=0.0
res=[]
for index, row in MyDataFrame.iterrows():
  if (row['account'],row['month'])==(account,month):
    salary+=row['salary']
  else:
    res.append([account,month,salary])
    salary=0.0
    (account,month)=(row['account'],row['month'])
df=pd.DataFrame(res,columns=['account','month','salary'])

This way, pandas don't need to hold the grouped data in memory.

like image 43
Uri Goren Avatar answered Oct 11 '22 13:10

Uri Goren