Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

quantile cut by group in data.table

I would like to do quantile cuts (cut into n bins with equal number of points) for each group

qcut = function(x, n) {
  quantiles = seq(0, 1, length.out = n+1)
  cutpoints = unname(quantile(x, quantiles, na.rm = TRUE))
  cut(x, cutpoints, include.lowest = TRUE)
}

library(data.table)
dt = data.table(A = 1:10, B = c(1,1,1,1,1,2,2,2,2,2))
dt[, bin := qcut(A, 3)]
dt[, bin2 := qcut(A, 3), by = B]

dt
A     B    bin        bin2
 1:  1 1  [1,4]    [6,7.33]
 2:  2 1  [1,4]    [6,7.33]
 3:  3 1  [1,4] (7.33,8.67]
 4:  4 1  [1,4]   (8.67,10]
 5:  5 1  (4,7]   (8.67,10]
 6:  6 2  (4,7]    [6,7.33]
 7:  7 2  (4,7]    [6,7.33]
 8:  8 2 (7,10] (7.33,8.67]
 9:  9 2 (7,10]   (8.67,10]
10: 10 2 (7,10]   (8.67,10]

Here the cut without grouping is correct -- data lie in the bin. But the result by group is wrong.

How can I fix that?

like image 913
jf328 Avatar asked Mar 22 '17 10:03

jf328


People also ask

What is quantile cut?

In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created.

How do you calculate 0.25 quantile?

Alternative methods of calculating quantiles p=0.25 for the lower quartile), then you can do the following: (i) sort the original data in increasing order, (ii) find the (p*(n + 1))th number along. Then, since n = 23 as in our example, (p*(n + 1)) = (0.25*24) = 6.

How do you describe a quantile?

A quantile defines a particular part of a data set, i.e. a quantile determines how many values in a distribution are above or below a certain limit. Special quantiles are the quartile (quarter), the quintile (fifth) and percentiles (hundredth).

How do you create a decile in R?

To place each data value into a decile, we can use the ntile(x, ngroups) function from the dplyr package in R. What is this? The way to interpret the output is as follows: The data value 56 falls between the percentile 0% and 10%, thus it falls in the first decile.


1 Answers

This is a bug in handling of factors. Please check if it is known (or fixed in the development version) and report it to the data.table bug tracker otherwise.

qcut = function(x, n) {
  quantiles = seq(0, 1, length.out = n+1)
  cutpoints = unname(quantile(x, quantiles, na.rm = TRUE))
  as.character(cut(x, cutpoints, include.lowest = TRUE))
}

dt[, bin2 := qcut(A, 3), by = B]
#     A B    bin        bin2
# 1:  1 1  [1,4]    [1,2.33]
# 2:  2 1  [1,4]    [1,2.33]
# 3:  3 1  [1,4] (2.33,3.67]
# 4:  4 1  [1,4]    (3.67,5]
# 5:  5 1  (4,7]    (3.67,5]
# 6:  6 2  (4,7]    [6,7.33]
# 7:  7 2  (4,7]    [6,7.33]
# 8:  8 2 (7,10] (7.33,8.67]
# 9:  9 2 (7,10]   (8.67,10]
#10: 10 2 (7,10]   (8.67,10]
like image 152
Roland Avatar answered Sep 17 '22 19:09

Roland