Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying pandas qcut bins to new data

Tags:

python

pandas

I am using pandas qcut to split some data into 20 bins as part of data prep for training of a binary classification model like so:

data['VAR_BIN'] = pd.qcut(cc_data[var], 20, labels=False)

My question is, how can I apply the same binning logic derived from the qcut statement above to a new set of data, say for model validation purposes. Is there an easy way to do this?

Thanks

like image 878
GRN Avatar asked Jun 19 '16 10:06

GRN


People also ask

How do you split data into bins in Python?

Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.

What does QCUT do in pandas?

The pandas documentation describes qcut as a “Quantile-based discretization function.” This basically means that qcut tries to divide up the underlying data into equal sized bins. The function defines the bins using percentiles based on the distribution of the data, not the actual numeric edges of the bins.

How do bins work in pandas?

In Python pandas binning by distance is achieved by means of the cut() function. We group values related to the column Cupcake into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls.

How do you make pandas Age bins?

pandas Grouping Data Grouping numbers a sequence of integers denoting the endpoint of the left-open intervals in which the data is divided into—for instance bins=[19, 40, 65, np. inf] creates three age groups (19, 40] , (40, 65] , and (65, np. inf] .


1 Answers

You can do it by passing retbins=True.

Consider the following DataFrame:

import pandas as pd
import numpy as np
prng = np.random.RandomState(0)
df = pd.DataFrame(prng.randn(100, 2), columns = ["A", "B"])

pd.qcut(df["A"], 20, retbins=True, labels=False) returns a tuple whose second element is the bins. So you can do:

ser, bins = pd.qcut(df["A"], 20, retbins=True, labels=False)

ser is the categorical series and bins are the break points. Now you can pass bins to pd.cut to apply the same grouping to the other column:

pd.cut(df["B"], bins=bins, labels=False, include_lowest=True)
Out[38]: 
0     13
1     19
2      3
3      9
4     13
5     17
...
like image 136
ayhan Avatar answered Sep 30 '22 03:09

ayhan