Difference between ntile and cut and then quantile() function in R

Tags:

I found two threads on this topic for calculating deciles in R. However, both the methods i.e. dplyr::ntile and quantile() yield different output. In fact, dplyr::ntile() fails to output proper deciles.

Method 1: Using ntile() From R: splitting dataset into quartiles/deciles. What is the right method? thread, we could use ntile().

Here's my code:

vector<-c(0.0242034679584454, 0.0240411606258083, 0.00519255930109344, 
  0.00948031338483081, 0.000549450549450549, 0.085972850678733, 
  0.00231687756193192, NA, 0.1131625967838, 0.00539244534707915, 
  0.0604885614579294, 0.0352030947775629, 0.00935626135385923, 
  0.401201201201201, 0.0208212839791787, NA, 0.0462887301644538, 
  0.0224952741020794, NA, NA, 0.000984952654008562)

ntile(vector,10)

The output is:

ntile(vector,10)
5  5  2  3  1  7  1 NA  8  2  7  6  3  8  4 NA  6  4 NA NA  1

If we analyze this, we see that there is no 10th quantile!

Method 2: using quantile() Now, let's use the method from How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame thread.

Here's my code:

as.numeric(cut(vector, breaks=quantile(vector, probs=seq(0,1, length  = 11), na.rm=TRUE),include.lowest=TRUE))

The output is:

 7  6  2  4  1  9  2 NA 10  3  9  7  4 10  5 NA  8  5 NA NA  1

As we can see, the outputs are completely different. What am I missing here? I'd appreciate any thoughts.

Is this a bug in ntile() function?

636

asked Jan 14 '17 18:01

watchtower

1 Answers

In dplyr::ntile NA is always last (highest rank), and that is why you don't see the 10th decile in this case. If you want the deciles not to consider NAs, you can define a function like the one here which I use next:

ntile_na <- function(x,n)
{
  notna <- !is.na(x)
  out <- rep(NA_real_,length(x))
  out[notna] <- ntile(x[notna],n)
  return(out)
}

ntile_na(vector, 10)
# [1]  6  6  2  4  1  9  2 NA  9  3  8  7  3 10  5 NA  8  5 NA NA  1

Also, quantile has 9 ways of computing quantiles, you are using the default, which is the number 7 (you can check ?stats::quantile for the different types, and here for the discussion about them).

If you try

as.numeric(cut(vector, 
               breaks = quantile(vector, 
                                 probs = seq(0, 1, length = 11), 
                                 na.rm = TRUE,
                                 type = 2),
               include.lowest = TRUE))
# [1]  6  6  2  4  1  9  2 NA  9  3  8  7  3 10  5 NA  8  5 NA NA  1

you have the same result as the one using ntile.

In summary: it is not a bug, it is just the different ways they are implemented.

171

answered Oct 21 '22 15:10

DAVL

Related questions
                            
                                Real part of complex number?
                            
                                Enter passwords interactively in R or R Studio (Server)?
                            
                                Error opening SHP file in R using maptools readShapePoly
                            
                                R selecting all rows from a data frame that don't appear in another
                            
                                How to use 'hclust' as function call in R
                            
                                Suggestions needed for building R server REST API's that I can call from external app?
                            
                                Control number formatting in Shiny's implementation of DataTable
                            
                                R knitr PDF problems with \includegraphics
                            
                                Get ggplot2 legend to display percentage sign in r
                            
                                Conditional dataframe mutations in R with magrittr and dplyr
                            
                                Installing package - cannot open file - permission denied
                            
                                Subset string by counting specific characters
                            
                                plotting two vectors of data on a GGPLOT2 scatter plot using R
                            
                                using predict with a list of lm() objects
                            
                                Programmatically creating a data frame and adding rows to it
                            
                                Arrangement of large number of plots and connect with lines in r
                            
                                Subset data based on partial match of column names
                            
                                Add new columns to a data.table containing many variables
                            
                                Send a POST request using httr R package
                            
                                How to temporarily supress warnings in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between ntile and cut and then quantile() function in R

Tags:

r

dplyr

watchtower

People also ask

1 Answers

DAVL

Recent Activity

Donate For Us