I am using pandas module. In my DataFrame 3 fields are account ,month and salary.
account month Salary
1 201501 10000
2 201506 20000
2 201506 20000
3 201508 30000
3 201508 30000
3 201506 10000
3 201506 10000
3 201506 10000
3 201506 10000
I am doing groupby on Account and Month and convert salary to percent of salary of group it belongs.
MyDataFrame['salary'] = MyDataFrame.groupby(['account'], ['month'])['salary'].transform(lambda x: x/x.sum())
Now MyDataFrame becomes like below table
account month Salary
1 201501 1
2 201506 .5
2 201506 .5
3 201508 .5
3 201508 .5
3 201506 .25
3 201506 .25
3 201506 .25
3 201506 .25
Problem is: Operation on 50 million such rows is taking 3 hours. I executed groupyby separately it is fast takes 5 seconds only.I think it is transform taking long time here. is there any way to improve performance ?
Update: To provide more clarity adding example Some account holder received salary 2000 in Jun and 8000 in July so his proportion becomes .2 for Jun and .8 for July. my purpose is to calculate this proportion.
Well you need be more explicit and show exactly what you are doing. This is something pandas excels at.
Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.
In [20]: np.random.seed(1234)
In [21]: ngroups = 1000
In [22]: nrows = 50000000
In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups)
In [24]: df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows),
'date' : dates.take(np.random.randint(0,ngroups,size=nrows)),
'values' : np.random.randn(nrows) })
In [25]:
In [25]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000000 entries, 0 to 49999999
Data columns (total 3 columns):
account int64
date datetime64[ns]
values float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 1.5 GB
In [26]: df.head()
Out[26]:
account date values
0 815 2048-02-01 -0.412587
1 723 2023-01-01 -0.098131
2 294 2020-11-01 -2.899752
3 53 2058-02-01 -0.469925
4 204 2080-11-01 1.389950
In [27]: %timeit df.groupby(['account','date']).sum()
1 loops, best of 3: 8.08 s per loop
If you want to transform the output, then doit like this
In [37]: g = df.groupby(['account','date'])['values']
In [38]: result = 100*df['values']/g.transform('sum')
In [41]: result.head()
Out[41]:
0 4.688957
1 -2.340621
2 -80.042089
3 -13.813078
4 -70.857014
dtype: float64
In [43]: len(result)
Out[43]: 50000000
In [42]: %timeit 100*df['values']/g.transform('sum')
1 loops, best of 3: 30.9 s per loop
Take a bit longer. But again this should be a relatively fast operation.
I would use a different approach First Sort,
MyDataFrame.sort(['account','month'],inplace=True)
Then iterate and sum
(account,month)=('','') #some invalid values
salary=0.0
res=[]
for index, row in MyDataFrame.iterrows():
if (row['account'],row['month'])==(account,month):
salary+=row['salary']
else:
res.append([account,month,salary])
salary=0.0
(account,month)=(row['account'],row['month'])
df=pd.DataFrame(res,columns=['account','month','salary'])
This way, pandas don't need to hold the grouped data in memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With