Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between ntile and cut and then quantile() function in R

Tags:

r

dplyr

I found two threads on this topic for calculating deciles in R. However, both the methods i.e. dplyr::ntile and quantile() yield different output. In fact, dplyr::ntile() fails to output proper deciles.

Method 1: Using ntile() From R: splitting dataset into quartiles/deciles. What is the right method? thread, we could use ntile().

Here's my code:

vector<-c(0.0242034679584454, 0.0240411606258083, 0.00519255930109344, 
  0.00948031338483081, 0.000549450549450549, 0.085972850678733, 
  0.00231687756193192, NA, 0.1131625967838, 0.00539244534707915, 
  0.0604885614579294, 0.0352030947775629, 0.00935626135385923, 
  0.401201201201201, 0.0208212839791787, NA, 0.0462887301644538, 
  0.0224952741020794, NA, NA, 0.000984952654008562)

ntile(vector,10)

The output is:

ntile(vector,10)
5  5  2  3  1  7  1 NA  8  2  7  6  3  8  4 NA  6  4 NA NA  1

If we analyze this, we see that there is no 10th quantile!

Method 2: using quantile() Now, let's use the method from How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame thread.

Here's my code:

as.numeric(cut(vector, breaks=quantile(vector, probs=seq(0,1, length  = 11), na.rm=TRUE),include.lowest=TRUE))

The output is:

 7  6  2  4  1  9  2 NA 10  3  9  7  4 10  5 NA  8  5 NA NA  1

As we can see, the outputs are completely different. What am I missing here? I'd appreciate any thoughts.

Is this a bug in ntile() function?

like image 636
watchtower Avatar asked Jan 14 '17 18:01

watchtower


People also ask

What does Ntile do in R?

The ntile() function is used to divide the data into N bins there by providing ntile rank. If the data is divided into 100 bins by ntile(), percentile rank in R is calculated on a particular column. similarly if the data is divided into 4 and 10 bins by ntile() function it will result in quantile and decile rank in R.

What is quantile return R?

Well, whenever you use the function quantile, it returns the standard percentiles like 25,50 and 75 percentiles. But what if you want 47th percentile or maybe 88th percentile? There comes the argument 'probs', in which you can specify the required percentiles to get those.

How does R calculate quantiles by group?

To group data, we use dplyr module. This module contains a function called group_by() in which the column to be grouped by has to be passed. To find quantiles of the grouped data we will call summarize method with quantiles() function.

How do you divide data into deciles in R?

To place each data value into a decile, we can use the ntile(x, ngroups) function from the dplyr package in R. What is this? The way to interpret the output is as follows: The data value 56 falls between the percentile 0% and 10%, thus it falls in the first decile.


1 Answers

In dplyr::ntile NA is always last (highest rank), and that is why you don't see the 10th decile in this case. If you want the deciles not to consider NAs, you can define a function like the one here which I use next:

ntile_na <- function(x,n)
{
  notna <- !is.na(x)
  out <- rep(NA_real_,length(x))
  out[notna] <- ntile(x[notna],n)
  return(out)
}

ntile_na(vector, 10)
# [1]  6  6  2  4  1  9  2 NA  9  3  8  7  3 10  5 NA  8  5 NA NA  1

Also, quantile has 9 ways of computing quantiles, you are using the default, which is the number 7 (you can check ?stats::quantile for the different types, and here for the discussion about them).

If you try

as.numeric(cut(vector, 
               breaks = quantile(vector, 
                                 probs = seq(0, 1, length = 11), 
                                 na.rm = TRUE,
                                 type = 2),
               include.lowest = TRUE))
# [1]  6  6  2  4  1  9  2 NA  9  3  8  7  3 10  5 NA  8  5 NA NA  1

you have the same result as the one using ntile.

In summary: it is not a bug, it is just the different ways they are implemented.

like image 171
DAVL Avatar answered Oct 21 '22 15:10

DAVL