How to map numeric data into categories / bins in Pandas dataframe

Q: How do you split data into bins in Python?

Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.

Q: How do I create a category in Pandas?

How to add new categories to a Pandas categorical column? Pass the category or a list of categories (if adding multiple categories) as an argument to the function. The additional categories are added to the list of possible category values for that field.

Tags:

python

pandas

dataframe

numpy

python-2.7

I've just started coding in python, and my general coding skills are fairly rusty :( so please be a bit patient

I have a pandas dataframe:

SamplePandas

It has around 3m rows. There are 3 kinds of age_units: Y, D, W for years, Days & Weeks. Any individual over 1 year old has an age unit of Y and my first grouping I want is <2y old so all I have to test for in Age Units is Y...

I want to create a new column AgeRange and populate with the following ranges:

<2
2 - 18
18 - 35
35 - 65
65+

so I wrote a function

def agerange(values):     for i in values:         if complete.Age_units == 'Y':             if complete.Age > 1 AND < 18 return '2-18'             elif complete.Age > 17 AND < 35 return '18-35'             elif complete.Age > 34 AND < 65 return '35-65'             elif complete.Age > 64 return '65+'         else return '< 2'

I thought if I passed in the dataframe as a whole I would get back what I needed and then could create the column I wanted something like this:

agedetails['age_range'] = ageRange(agedetails)

BUT when I try to run the first code to create the function I get:

  File "<ipython-input-124-cf39c7ce66d9>", line 4     if complete.Age > 1 AND complete.Age < 18 return '2-18'                           ^ SyntaxError: invalid syntax

Clearly it is not accepting the AND - but I thought I heard in class I could use AND like this? I must be mistaken but then what would be the right way to do this?

So after getting that error, I'm not even sure the method of passing in a dataframe will throw an error either. I am guessing probably yes. In which case - how would I make that work as well?

I am looking to learn the best method, but part of the best method for me is keeping it simple even if that means doing things in a couple of steps...

521

asked Mar 20 '18 10:03

kiltannen

1 Answers

With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.

Pandas: `pd.cut`

As @JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.

You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.

bins = [0, 2, 18, 35, 65, np.inf] names = ['<2', '2-18', '18-35', '35-65', '65+']  df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)  print(df.dtypes)  # Age             int64 # Age_units      object # AgeRange     category # dtype: object

NumPy: `np.digitize`

np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.

Note that for boundary cases the lower bound is used for mapping to a bin.

import pandas as pd, numpy as np  df = pd.DataFrame({'Age': [99, 53, 71, 84, 84],                    'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y']})  bins = [0, 2, 18, 35, 65] names = ['<2', '2-18', '18-35', '35-65', '65+']  d = dict(enumerate(names, 1))  df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))

Result

   Age Age_units AgeRange 0   99         Y      65+ 1   53         Y    35-65 2   71         Y      65+ 3   84         Y      65+ 4   84         Y      65+

answered Sep 18 '22 01:09

jpp

Related questions
                            
                                Default message in custom exception - Python
                            
                                ModuleNotFoundError: No module named 'sklearn.externals.six'
                            
                                Is there a way to decode numerical COM error-codes in pywin32
                            
                                Python xlwt - accessing existing cell content, auto-adjust column width
                            
                                dict.get() - default arg evaluated even upon success
                            
                                Use "contains" and "iexact" at the same query in DJANGO
                            
                                Static classes in Python
                            
                                How can I change the font size of ticks of axes object in matplotlib [duplicate]
                            
                                Finding consecutive segments in a pandas data frame
                            
                                Recursive os.listdir? [duplicate]
                            
                                How to find linearly independent rows from a matrix
                            
                                Uploading a file to a S3 bucket with a prefix using Boto3
                            
                                Django Vote Up/Down method [closed]
                            
                                Limiting Memory Use in a *Large* Django QuerySet
                            
                                Which command to use for checking whether python is 64bit or 32bit
                            
                                predict_proba for a cross-validated model
                            
                                pandas: how do I select first row in each GROUP BY group?
                            
                                What's the difference between list() and [] [duplicate]
                            
                                How do I avoid KeyError when working with dictionaries?
                            
                                RNN Regularization: Which Component to Regularize?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to map numeric data into categories / bins in Pandas dataframe

Tags:

python

pandas

dataframe

numpy

python-2.7

kiltannen

People also ask

1 Answers

Pandas: `pd.cut`

NumPy: `np.digitize`

Result

jpp

Recent Activity

Donate For Us

How to map numeric data into categories / bins in Pandas dataframe

Tags:

python

pandas

dataframe

numpy

python-2.7

kiltannen

People also ask

1 Answers

Pandas: pd.cut

NumPy: np.digitize

Result

jpp

Related questions

Recent Activity

Donate For Us

Pandas: `pd.cut`

NumPy: `np.digitize`