Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is cut() style binning available in dplyr?

Tags:

sql

r

dplyr

binning

Is there a way to do something like a cut() function for binning numeric values in a dplyr table? I'm working on a large postgres table and can currently either write a case statement in the sql at the outset, or output unaggregated data and apply cut(). Both have pretty obvious downsides... case statements are not particularly elegant and pulling a large number of records via collect() not at all efficient.

like image 813
Michael Williams Avatar asked Feb 11 '14 22:02

Michael Williams


People also ask

What is binning data in R?

Binning is the process of transforming numerical or continuous data into categorical data. It is a common data pre-processing step of the model building process. rbin has the following features: manual binning using shiny app.

How do I create a bin in R?

To create the bins for a continuous vector, we can use cut function and store the bins in a data frame along with the original vector. The values in the cut function must be passed based on the range of the vector values, otherwise, there will be NA's in the bin values.

What are bins in R?

Grouping by a range of values is referred to as data binning or bucketing in data science, i.e., categorizing a number of continuous values into a smaller number of bins (buckets). Each bucket defines an interval. A category name is assigned each bucket.


1 Answers

Just so there's an immediate answer for others arriving here via search engine, the n-breaks form of cut is now implemented as the ntile function in dplyr:

> data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = ntile(x, 2))   x bin 1 5   2 2 1   1 3 3   2 4 2   1 5 2   1 6 3   2 
like image 57
drhagen Avatar answered Sep 22 '22 12:09

drhagen