I'm having trouble finding a function in R that performs equal-frequency discretization. I stumbled on the 'infotheo' package, but after some testing I found that the algorithm is broken. 'dprep' seems to no longer be supported on CRAN.
EDIT :
For clarity, I do not need to seperate the values between the bins. I really want equal frequency, it doesn't matter if one value ends up in two bins. Eg :
c(1,3,2,1,2,2)
should give a bin c(1,1,2)
and one c(2,2,3)
EDIT : given your real goal, why don't you just do (corrected) :
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
This returns a vector with indicators for which bin they are. But as some values might be present in both bins, you can't possibly define the bin limits. But you can do :
x <- rpois(50,5)
y <- EqualFreq2(x,15)
table(y)
split(x,y)
Original answer:
You can easily just use cut()
for this :
EqualFreq <-function(x,n,include.lowest=TRUE,...){
nx <- length(x)
id <- round(c(1,(1:(n-1))*(nx/n),nx))
breaks <- sort(x)[id]
if( sum(duplicated(breaks))>0 stop("n is too large.")
cut(x,breaks,include.lowest=include.lowest,...)
}
Which gives :
set.seed(12345)
x <- rnorm(50)
table(EqualFreq(x,5))
[-2.38,-0.886] (-0.886,-0.116] (-0.116,0.586] (0.586,0.937] (0.937,2.2]
10 10 10 10 10
x <- rpois(50,5)
table(EqualFreq(x,5))
[1,3] (3,5] (5,6] (6,7] (7,11]
10 13 11 6 10
As you see, for discrete data an optimal equal binning is rather impossible in most cases, but this method gives you the best possible binning available.
This sort of thing is also quite easily solved by using (abusing?) the conditioning plot infrastructure from lattice, in particular function co.intervals()
:
cutEqual <- function(x, n, include.lowest = TRUE, ...) {
stopifnot(require(lattice))
cut(x, co.intervals(x, n, 0)[c(1, (n+1):(n*2))],
include.lowest = include.lowest, ...)
}
Which reproduces @Joris' excellent answer:
> set.seed(12345)
> x <- rnorm(50)
> table(cutEqual(x, 5))
[-2.38,-0.885] (-0.885,-0.115] (-0.115,0.587] (0.587,0.938] (0.938,2.2]
10 10 10 10 10
> y <- rpois(50, 5)
> table(cutEqual(y, 5))
[0.5,3.5] (3.5,5.5] (5.5,6.5] (6.5,7.5] (7.5,11.5]
10 13 11 6 10
In the latter, discrete, case the breaks are different although they have the same effect; the same observations are in the same bins.
How about?
a <- rnorm(50)
> table(Hmisc::cut2(a, m = 10))
[-2.2020,-0.7710) [-0.7710,-0.2352) [-0.2352, 0.0997) [ 0.0997, 0.9775)
10 10 10 10
[ 0.9775, 2.5677]
10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With