I have a pandas dataframe which contains data as shown below:
ID year_month_id Class
1 201612 A
2 201612 D
3 201612 B
4 201612 Other
5 201612 Other
6 201612 Other
7 201612 A
8 201612 Other
9 201612 A
1 201701 B
So an ID can be under any class in a particular month and next month his class might change. Now what I want to do is for each ID get the number of months it has been under a particular class and also the latest class under which it falls. Something like below:
ID Class_A Class_B Class_D Other Latest_Class
1 2 3 4 0 B
2 12 0 0 0 D
How do I achieve this in python. Can someone please help me with this? Also , since the real dataset is huge and manually verifying is not possible, how can I get a list of ID's which fall under more than 1 class?
We can use pivot table and concat i.e
ndf = df.pivot_table(index=['ID'],columns=['Class'],aggfunc='count',fill_value=0)\
.xs('year_month_id', axis=1, drop_level=True)
ndf['latest'] = df.sort_values('ID').groupby('ID')['Class'].tail(1).values
Class A B D Other latest
ID
1 1 1 0 0 B
2 0 0 1 0 D
3 0 1 0 0 B
4 0 0 0 1 Other
5 0 0 0 1 Other
6 0 0 0 1 Other
7 1 0 0 0 A
8 0 0 0 1 Other
9 1 0 0 0 A
You can get counts by groupby
with aggregate count
, reshape by unstack
. Last add new column with drop_duplicates
:
df1 = df.groupby(['ID','Class'])['year_month_id'].count().unstack(fill_value=0)
df1['Latest_Class'] = df.drop_duplicates('ID', keep='last').set_index('ID')['Class']
print (df1)
Class A B D Other Latest_Class
ID
1 1 1 0 0 B
2 0 0 1 0 D
3 0 1 0 0 B
4 0 0 0 1 Other
5 0 0 0 1 Other
6 0 0 0 1 Other
7 1 0 0 0 A
8 0 0 0 1 Other
9 1 0 0 0 A
You can get a count of classes attended with groupby
+ value_counts
+ unstack
-
g = df.groupby('ID')
i = g.Class.value_counts().unstack(fill_value=0)
To get the last Class, use groupby
+ last
-
j = g.Class.last()
Concatenate to get your result -
pd.concat([i, j], 1).rename(columns={'Class': 'LastClass'})
A B D Other LastClass
ID
1 1 1 0 0 B
2 0 0 1 0 D
3 0 1 0 0 B
4 0 0 0 1 Other
5 0 0 0 1 Other
6 0 0 0 1 Other
7 1 0 0 0 A
8 0 0 0 1 Other
9 1 0 0 0 A
To get a list of IDs which have more than 1 per row, use sum
+ a mask -
k = i.sum(axis=1)
k[k > 1]
ID
1 2
dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With