Is there a way to do something like a cut()
function for binning numeric values in a dplyr
table? I'm working on a large postgres table and can currently either write a case statement in the sql at the outset, or output unaggregated data and apply cut()
. Both have pretty obvious downsides... case statements are not particularly elegant and pulling a large number of records via collect()
not at all efficient.
Binning is the process of transforming numerical or continuous data into categorical data. It is a common data pre-processing step of the model building process. rbin has the following features: manual binning using shiny app.
To create the bins for a continuous vector, we can use cut function and store the bins in a data frame along with the original vector. The values in the cut function must be passed based on the range of the vector values, otherwise, there will be NA's in the bin values.
Grouping by a range of values is referred to as data binning or bucketing in data science, i.e., categorizing a number of continuous values into a smaller number of bins (buckets). Each bucket defines an interval. A category name is assigned each bucket.
Just so there's an immediate answer for others arriving here via search engine, the n-breaks form of cut
is now implemented as the ntile
function in dplyr
:
> data.frame(x = c(5, 1, 3, 2, 2, 3)) %>% mutate(bin = ntile(x, 2)) x bin 1 5 2 2 1 1 3 3 2 4 2 1 5 2 1 6 3 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With