Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating peaks in histograms or density functions

Tags:

r

histogram

There seem to be a lot of "peaks in density function" threads already, but I don't see one addressing this point specifically. Sorry to duplicate if I missed it.

My problem: Given a vector of 1000 values (sample attached), I would like to identify the peaks in the histogram or density function of the data. From the image of the sample data below , I can see peaks in the histogram at ~0, 6200, and 8400. But I need the obtain the exact values of these peaks, preferably in a simple procedure as I have several thousand of these vectors to process.

Hist and Density Function

I originally started working with the histogram outputs themselves, but couldn't get any peak-finding command to work properly (like, not at all). I'm not even sure how it would get the peaks() command from the splus2R package to work on histogram object or on a density object. This would still be my preference, as I would like to identify the exact data value of the max frequency of each peak (as opposed to the density function value, which is slightly different), but I can't figure that one out either.

I would post the sample data themselves, but I can't see a way to do that on here (sorry if I'm just missing it).

like image 425
David Roberts Avatar asked Oct 30 '12 05:10

David Roberts


4 Answers

Finding Peaks in density functions is, as already given in the comments, related to Finding local maxima and minima where you can find more solutions. The answer of chthonicdaemon is close to the peak, but each diff is reducing the vector length by one.

#Create Dataset
x <- c(1,1,4,4,9)

#Estimate Density
d <- density(x)

#Two ways to get highest Peak
d$x[d$y==max(d$y)]  #Gives you all highest Peaks
d$x[which.max(d$y)] #Gives you the first highest Peak

#3 ways to get all Peaks
d$x[c(F, diff(diff(d$y)>=0)<0)] #This detects also a plateau
d$x[c(F, diff(sign(diff(d$y)))<0)]
d$x[which(diff(sign(diff(d$y)))<0)+1]

#In case you also want the height of the peaks
data.frame(d[c("x", "y")])[c(F, diff(diff(d$y)>=0)<0),]

#In case you need a higher "precision"
d <- density(x, n=1e4)
like image 100
GKi Avatar answered Nov 13 '22 03:11

GKi


After a good 8+ years later this is still a valid and classic question. Here's a complete answer now with the excellent clue given by @chthonicdaemon.

library(ggplot)
library(data.table)
### I use a preloaded data.table. You can use any data.table with one numeric column x.
### Extract counts & breaks of the histogram bins. 
### I have taken breaks as 40 but you can take any number as needed.
### But do keep a large number of breaks so that you get multiple peaks.
counts <- hist(dt1$x,breaks = 40)$counts
breaks <- hist(dt1$x, breaks = 40)$breaks
### Note: the data.table `dt1` should contain at least one numeric column, x

### now name the counts vector with the corresponding breaks 
### note: the length of counts is 1 less than the breaks
names(counts) <- breaks[-length(breaks)]

### Find index for those counts that are the peaks 
### (see previous classic clue to take a double diff)
### note: the double diff causes the 2 count shrink, hence
#### I have added a FALSE before and after the results 
### to align the T/F vector with the count vector

peak_indx <- c(F,diff(sign(c(diff(counts))))==-2,F) %>% which()
topcounts <- counts[peak_indx]
topbreaks <- names(topcounts) %>% as.numeric()

### Now let's use ggplot to plot the histogram along with visualised peaks

dt1 %>%     
ggplot() + 
geom_histogram(aes(x),bins = 40,col="grey51",na.rm = T) + 
geom_vline(xintercept = topbreaks + 50,lty = 2) 
# adjust the value 50 to bring the lines in the centre

Output histogram with peaks marked

like image 34
Lazarus Thurston Avatar answered Nov 13 '22 03:11

Lazarus Thurston


If your y values are smooth (like in your sample plot), this should find the peaks pretty repeatably:

peakx <- x[which(diff(sign(diff(y)))==-2)]
like image 27
chthonicdaemon Avatar answered Nov 13 '22 04:11

chthonicdaemon


Since you are thinking about histograms, maybe you should use the histogram output directly?

data <- c(rnorm(100,mean=20),rnorm(100,mean=12))

peakfinder <- function(d){
  dh <- hist(d,plot=FALSE)
  ins <- dh[["intensities"]]
  nbins <- length(ins)
  ss <- which(rank(ins)%in%seq(from=nbins-2,to=nbins)) ## pick the top 3 intensities
  dh[["mids"]][ss]
}

peaks <- peakfinder(data)

hist(data)
sapply(peaks,function(x) abline(v=x,col="red"))

This isn't perfect -- for example, it will find just the top bins, even if they are adjacent. Maybe you could define 'peak' more precisely? Hope that helps.

enter image description here

like image 3
AndrewMacDonald Avatar answered Nov 13 '22 03:11

AndrewMacDonald