Given this data frame:
import pandas as pd
import jenkspy
f = pd.DataFrame({'BreakGroup':['A','A','A','A','A','A','B','B','B','B','B'],
'Final':[1,2,3,4,5,6,10,20,30,40,50]})
BreakGroup Final
0 A 1
1 A 2
2 A 3
3 A 4
4 A 5
5 A 6
6 B 10
7 B 20
8 B 30
9 B 40
10 B 50
I'd like to use jenkspy to identify the group, based on natural breaks for 4 groups (classes), to which each value in "Final" within the group "BreakGroup" belongs.
I started out by doing this:
jenks=lambda x: jenkspy.jenks_breaks(f['Final'].tolist(),nb_class=4)
f['Group']=f.groupby(['BreakGroup'])['BreakGroup'].transform(jenks)
...which results in:
BreakGroup
A [1.0, 10.0, 20.0, 30.0, 50.0]
B [1.0, 10.0, 20.0, 30.0, 50.0]
Name: BreakGroup, dtype: object
The first problem here, as you may well have surmised, is that it applies the lambda function to the whole column of "Final" scores instead of just those belonging to each group in the Groupby. The second problem is that I need a column designating the correct group (class) membership, presumably by using transform instead of apply.
I then tried this:
jenks=lambda x: jenkspy.jenks_breaks(f['Final'].loc[f['BreakGroup']==x].tolist(),nb_class=4)
f['Group']=f.groupby(['BreakGroup'])['BreakGroup'].transform(jenks)
...but was promptly beaten back into submission:
ValueError: Can only compare identically-labeled Series objects
Update:
Here is the desired result. The "Result" column contains the upper limit of the group for the respective value from "Final" per group "BreakGroup":
BreakGroup Final Result
0 A 1 2
1 A 2 3
2 A 3 4
3 A 4 4
4 A 5 6
5 A 6 6
6 B 10 20
7 B 20 30
8 B 30 40
9 B 40 50
10 B 50 50
Thanks in advance!
My slightly modified application based on accepted solution:
f.sort_values('BreakGroup',inplace=True)
f.reset_index(drop=True,inplace=True)
jenks = lambda x: jenkspy.jenks_breaks(x['Final'].tolist(),nb_class=4)
g = f.set_index('BreakGroup')
g['Groups'] = f.groupby(['BreakGroup']).apply(jenks)
g.reset_index(inplace=True)
groups= lambda x: [gp for gp in x['Groups']]
#'final' value should be > lower and <= upper
upper = lambda x: [gp for gp in x['Groups'] if gp >= x['Final']][0] # or gp == max(x['Groups'])
lower= lambda x: [gp for gp in x['Groups'] if gp < x['Final'] or gp == min(x['Groups'])][-1]
GroupIndex= lambda x: [x['Groups'].index(gp) for gp in x['Groups'] if gp < x['Final'] or gp == min(x['Groups'])][-1]
f['Groups']=g.apply(groups, axis=1)
f['Upper'] = g.apply(upper, axis=1)
f['Lower'] = g.apply(lower, axis=1)
f['Group'] = g.apply(GroupIndex, axis=1)
f['Group']=f['Group']+1
This returns:
The list of group boundaries
The upper boundary pertinent to the value for "Final"
The lower boundary pertinent to the value for "Final"
The group to which the value for "Final" will belong based on logic noted in comments.
You have jenks defined as a constant in terms of x, your lambda variable, so it doesn't depend on what you feed it with apply or transform. Changing the definition of jenks to
jenks = lambda x: jenkspy.jenks_breaks(x['Final'].tolist(),nb_class=4)
gives
In [315]: f.groupby(['BreakGroup']).apply(jenks)
Out[315]:
BreakGroup
A [1.0, 2.0, 3.0, 4.0, 6.0]
B [10.0, 20.0, 30.0, 40.0, 50.0]
dtype: object
Continuing from this redefinition,
g = f.set_index('BreakGroup')
g['Groups'] = f.groupby(['BreakGroup']).apply(jenks)
g.reset_index(inplace=True)
group = lambda x: [gp for gp in x['Groups'] if gp > x['Final'] or gp == max(x['Groups'])][0]
f['Result'] = g.apply(group, axis=1)
gives
In [323]: f
Out[323]:
BreakGroup Final Result
0 A 1 2.0
1 A 2 3.0
2 A 3 4.0
3 A 4 6.0
4 A 5 6.0
5 A 6 6.0
6 B 10 20.0
7 B 20 30.0
8 B 30 40.0
9 B 40 50.0
10 B 50 50.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With