Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas how to use pd.cut()

Tags:

python

pandas

Here is the snippet:

test = pd.DataFrame({'days': [0,31,45]})
test['range'] = pd.cut(test.days, [0,30,60])

Output:

    days    range
0   0       NaN
1   31      (30, 60]
2   45      (30, 60]

I am surprised that 0 is not in (0, 30], what should I do to categorize 0 as (0, 30]?

like image 430
Cheng Avatar asked Aug 18 '17 07:08

Cheng


People also ask

How do you use PD cut function?

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

How do I cut columns in pandas?

To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.

How do you split data into bins in Python?

Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.

How do you cut a pandas DataFrame?

Slicing a DataFrame in Pandas includes the following steps:Ensure Python is installed (or install ActivePython) Import a dataset. Create a DataFrame. Slice the DataFrame.


4 Answers

test['range'] = pd.cut(test.days, [0,30,60], include_lowest=True) print (test)    days           range 0     0  (-0.001, 30.0] 1    31    (30.0, 60.0] 2    45    (30.0, 60.0] 

See difference:

test = pd.DataFrame({'days': [0,20,30,31,45,60]})  test['range1'] = pd.cut(test.days, [0,30,60], include_lowest=True) #30 value is in [30, 60) group test['range2'] = pd.cut(test.days, [0,30,60], right=False) #30 value is in (0, 30] group test['range3'] = pd.cut(test.days, [0,30,60]) print (test)    days          range1    range2    range3 0     0  (-0.001, 30.0]   [0, 30)       NaN 1    20  (-0.001, 30.0]   [0, 30)   (0, 30] 2    30  (-0.001, 30.0]  [30, 60)   (0, 30] 3    31    (30.0, 60.0]  [30, 60)  (30, 60] 4    45    (30.0, 60.0]  [30, 60)  (30, 60] 5    60    (30.0, 60.0]       NaN  (30, 60] 

Or use numpy.searchsorted, but values of days has to be sorted:

arr = np.array([0,30,60]) test['range1'] = arr.searchsorted(test.days) test['range2'] = arr.searchsorted(test.days, side='right') - 1 print (test)    days  range1  range2 0     0       0       0 1    20       1       0 2    30       1       1 3    31       2       1 4    45       2       1 5    60       2       2 
like image 121
jezrael Avatar answered Oct 03 '22 10:10

jezrael


pd.cut documentation
Include parameter right=False

test = pd.DataFrame({'days': [0,31,45]}) test['range'] = pd.cut(test.days, [0,30,60], right=False)  test     days     range 0     0   [0, 30) 1    31  [30, 60) 2    45  [30, 60) 
like image 22
piRSquared Avatar answered Oct 03 '22 11:10

piRSquared


You can use labels to pd.cut() as well. The following example contains the grade of students in the range from 0-10. We're adding a new column called 'grade_cat' to categorize the grades.

bins represent the intervals: 0-4 is one interval, 5-6 is one interval, and so on The corresponding labels are "poor", "normal", etc

bins = [0, 4, 6, 10]
labels = ["poor","normal","excellent"]
student['grade_cat'] = pd.cut(student['grade'], bins=bins, labels=labels)
like image 25
Mino De Raj Avatar answered Oct 03 '22 11:10

Mino De Raj


A sample of how the .cut works

s=pd.Series([168,180,174,190,170,185,179,181,175,169,182,177,180,171])
    pd.cut(s,3)
    #To add labels to bins
    pd.cut(s,3,labels=["Small","Medium","Large"])

This can be used directly on a range

like image 31
nashtgc Avatar answered Oct 03 '22 11:10

nashtgc