I have a dataframe say df
. df
has a column 'Ages'
>>> df['Age']
I want to group this ages and create a new column something like this
If age >= 0 & age < 2 then AgeGroup = Infant
If age >= 2 & age < 4 then AgeGroup = Toddler
If age >= 4 & age < 13 then AgeGroup = Kid
If age >= 13 & age < 20 then AgeGroup = Teen
and so on .....
How can I achieve this using Pandas library.
I tried doing this something like this
X_train_data['AgeGroup'][ X_train_data.Age < 13 ] = 'Kid'
X_train_data['AgeGroup'][ X_train_data.Age < 3 ] = 'Toddler'
X_train_data['AgeGroup'][ X_train_data.Age < 1 ] = 'Infant'
but doing this i get this warning
/Users/Anand/miniconda3/envs/learn/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until /Users/Anand/miniconda3/envs/learn/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
How to avoid this warning and do it in a better way.
Here one important thing is that categories generated in each column are not same, conversion is done column by column as we can see here: Now, in some works, we need to group our categorical data. This is done using the groupby () method given in pandas. It returns all the combinations of groupby columns.
The Pandas groupby function lets you split data into groups based on some criteria. Pandas DataFrames can be split on either axis, ie., row or column. To see how to group data in Python, let’s imagine ourselves as the director of a highschool.
Here’s the near-equivalent in Pandas: You call .groupby () and pass the name of the column you want to group on, which is "state". Then, you use ["last_name to specify the columns on which you want to perform the actual aggregation.
Instead, we can use Pandas’ groupby function to group the data into a Report_Card DataFrame we can more easily work with. We’ll start with a multi-level grouping example, which uses more than one argument for the groupby function and returns an iterable groupby-object that we can work on:
Use pandas.cut
with parameter right=False
for not includes the rightmost edge of bins:
X_train_data = pd.DataFrame({'Age':[0,2,4,13,35,-1,54]})
bins= [0,2,4,13,20,110]
labels = ['Infant','Toddler','Kid','Teen','Adult']
X_train_data['AgeGroup'] = pd.cut(X_train_data['Age'], bins=bins, labels=labels, right=False)
print (X_train_data)
Age AgeGroup
0 0 Infant
1 2 Toddler
2 4 Kid
3 13 Teen
4 35 Adult
5 -1 NaN
6 54 Adult
Last for replace missing value use add_categories
with fillna
:
X_train_data['AgeGroup'] = X_train_data['AgeGroup'].cat.add_categories('unknown')
.fillna('unknown')
print (X_train_data)
Age AgeGroup
0 0 Infant
1 2 Toddler
2 4 Kid
3 13 Teen
4 35 Adult
5 -1 unknown
6 54 Adult
bins= [-1,0,2,4,13,20, 110]
labels = ['unknown','Infant','Toddler','Kid','Teen', 'Adult']
X_train_data['AgeGroup'] = pd.cut(X_train_data['Age'], bins=bins, labels=labels, right=False)
print (X_train_data)
Age AgeGroup
0 0 Infant
1 2 Toddler
2 4 Kid
3 13 Teen
4 35 Adult
5 -1 unknown
6 54 Adult
Just use:
X_train_data.loc[(X_train_data.Age < 13), 'AgeGroup'] = 'Kid'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With