Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cut() error - 'breaks' are not unique

Tags:

r

I have following dataframe:

 a         
    ID   a.1    b.1     a.2   b.2
1    1  40.00   100.00  NA    88.89
2    2  100.00  100.00  100   100.00
3    3  50.00   100.00  75    100.00
4    4  66.67   59.38   NA    59.38
5    5  37.50   100.00  NA    100.00
6    6  100.00  100.00  100   100.00

When I apply the following code to this dataframe:

 temp <- do.call(rbind,strsplit(names(df)[-1],".",fixed=TRUE))
 dup.temp <- temp[duplicated(temp[,1]),]

 res <- lapply(dup.temp[,1],function(i) {
 breaks <- c(-Inf,quantile(a[,paste(i,1,sep=".")], na.rm=T),Inf)
 cut(a[,paste(i,2,sep=".")],breaks)
 })

the cut () function gives an error:

 Error in cut.default(a[, paste(i, 2, sep = ".")], breaks) : 
 'breaks' are not unique

However, the same code works perfectly well on similar dataframe:

 varnames<-c("ID", "a.1", "b.1", "c.1", "a.2", "b.2", "c.2")

 a <-matrix (c(1,2,3,4, 5, 6, 7), 2,7)

 colnames (a)<-varnames

 df<-as.data.frame (a)


    ID  a.1  b.1  c.1  a.2  b.2  c.2
  1  1    3    5    7    2    4    6
  2  2    4    6    1    3    5    7

 res <- lapply(dup.temp[,1],function(i) {
 breaks <- c(-Inf,quantile(a[,paste(i,1,sep=".")], na.rm=T),Inf)
 cut(a[,paste(i,2,sep=".")],breaks)
 })

 res
[[1]]
[1] (-Inf,3] (-Inf,3]
Levels: (-Inf,3] (3,3.25] (3.25,3.5] (3.5,3.75] (3.75,4] (4, Inf]

[[2]]
[1] (-Inf,5] (-Inf,5]
Levels: (-Inf,5] (5,5.25] (5.25,5.5] (5.5,5.75] (5.75,6] (6, Inf]

[[3]]
[1] (5.5,7] (5.5,7]
Levels: (-Inf,1] (1,2.5] (2.5,4] (4,5.5] (5.5,7] (7, Inf]

What it the reason for this error? How can it be fixed? Thank you.

like image 891
DSSS Avatar asked Apr 24 '13 06:04

DSSS


4 Answers

You get this error because quantile values in your data for columns b.1, a.2 and b.2 are the same for some levels, so they can't be directly used as breaks values in function cut().

apply(a,2,quantile,na.rm=T)
       ID      a.1    b.1   a.2      b.2
0%   1.00  37.5000  59.38  75.0  59.3800
25%  2.25  42.5000 100.00  87.5  91.6675
50%  3.50  58.3350 100.00 100.0 100.0000
75%  4.75  91.6675 100.00 100.0 100.0000
100% 6.00 100.0000 100.00 100.0 100.0000

One way to solve this problem would be to put quantile() inside unique() function - so you will remove all quantile values that are not unique. This of course will make less breaking points if quantiles are not unique.

res <- lapply(dup.temp[,1],function(i) {
  breaks <- c(-Inf,unique(quantile(a[,paste(i,1,sep=".")], na.rm=T)),Inf)
  cut(a[,paste(i,2,sep=".")],breaks)
})

[[1]]
[1] <NA>        (91.7,100]  (58.3,91.7] <NA>        <NA>        (91.7,100] 
Levels: (-Inf,37.5] (37.5,42.5] (42.5,58.3] (58.3,91.7] (91.7,100] (100, Inf]

[[2]]
[1] (59.4,100]  (59.4,100]  (59.4,100]  (-Inf,59.4] (59.4,100]  (59.4,100] 
Levels: (-Inf,59.4] (59.4,100] (100, Inf]
like image 171
Didzis Elferts Avatar answered Oct 21 '22 19:10

Didzis Elferts


If you'd rather keep the number of quantiles, another option is to just add a little bit of jitter, e.g.

breaks = c(-Inf,quantile(a[,paste(i,1,sep=".")], na.rm=T),Inf)
breaks = breaks + seq_along(breaks) * .Machine$double.eps
like image 30
eddi Avatar answered Oct 21 '22 20:10

eddi


This comes from the fact that your breaks are not unique. Instead of cut, you should use .bincode, which accepts a non unique vector of breaks.

x <- c(0, 0.01, 0.5, 0.99, 1)
breaks <- c(0, 0, 1, 1)
.bincode(x, breaks)
like image 14
Matthew Avatar answered Oct 21 '22 21:10

Matthew


If you actually mean the 10% or 25% portions of your population when you say decile, quartile etc. and not the actual numeric values of the decile/quartile buckets, you can rank your values first, and apply the quantile function on the ranks:

a <- c(1,1,1,2,3,4,5,6,7,7,7,7,99,0.5,100,54,3,100,100,100,11,11,12,11,0)
a_ranks <- rank(a, ties.method = "first")
decile <- cut(a_ranks, quantile(a_ranks, probs=0:10/10), include.lowest=TRUE, labels=FALSE)  
like image 7
Dimitar Nentchev Avatar answered Oct 21 '22 20:10

Dimitar Nentchev