Find local minimum in bimodal distribution with r

Tags:

My data are pre-processed image data and I want to seperate two classes. In therory (and hopefully in practice) the best threshold is the local minimum between the two peaks in the bimodal distributed data.

My testdata is: http://www.file-upload.net/download-9365389/data.txt.html

I tried to follow this thread: I plotted the histogram and calculated the kernel density function:

datafile <- read.table("....txt")
data <- data$V1
hist(data)

d <- density(data) # returns the density data with defaults
hist(data,prob=TRUE)
lines(d) # plots the results

But how to continue?

I would calculate the first and second derivates of the density function to find the local extrema, specifically the local minimum. However I have no idea how to do this in R and density(test) seems not to be a normal function. Thus please help me: how can I calculate the derivates and find the local minimum of the pit between the two peaks in the density function density(test)?

762

asked Aug 12 '14 12:08

Iris

1 Answers

There are a few ways to do this.

First, using d for the density as in your question, d$x and d$y contain the x and y values for the density plot. The minimum occurs when the derivative dy/dx = 0. Since the x-values are equally spaced, we can estimate dy using diff(d$y), and seek d$x where abs(diff(d$y)) is minimized:

d$x[which.min(abs(diff(d$y)))]
# [1] 2.415785

The problem is that the density curve could also be maximized when dy/dx = 0. In this case the minimum is shallow but the maxima are peaked, so it works, but you can't count on that.

So a second way uses optimize(...) which seeks a local minimum in a given interval. optimize(...) needs a function as argument, so we use approxfun(d$x,d$y) to create an interpolation function.

optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
# [1] 2.415791

Finally, we show that this is indeed the minimum:

hist(data,prob=TRUE)
lines(d, col="red", lty=2)
v <- optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
abline(v=v, col="blue")

Another approach, which is preferred actually, uses k-means clustering.

df <- read.csv(header=F,"data.txt")
colnames(df) = "X"

# bimodal
km <- kmeans(df,centers=2)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) + 
  geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
                     binwidth=0.5, color="grey50")+
  stat_density(geom="line", color="red")

The data actually looks more trimodal than bimodal.

# trimodal
km <- kmeans(df,centers=3)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) + 
  geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
                 binwidth=0.5, color="grey50")+
  stat_density(geom="line", color="red")

answered Oct 16 '22 23:10

jlhoward

Related questions
                            
                                Filter data.table by multiple columns, dynamically
                            
                                by() giving error when applying mean function over a data frame. What's happening?
                            
                                selecting rows with specific conditions in R
                            
                                drawing dendrogram from pre calculated distance matrix
                            
                                Data Table - Select Value of Column by Name From Another Column
                            
                                Counting the frequency of an element in a data frame [duplicate]
                            
                                Finding number of occurrences of a word in a file using R functions
                            
                                Rule of thumb for memory size of datasets in R
                            
                                R shiny uncheck checkboxGroup with actionbutton
                            
                                Major discrepancies between R and Stata for ARIMA
                            
                                findInterval() with varying intervals in data.table R
                            
                                Determining the area of Shapefiles
                            
                                Algorithm for automating pairwise significance grouping labels in R
                            
                                Creating Subgraph using igraph in R
                            
                                How to plot a certain range of x values of dates and times? Error "invalid xlim value"
                            
                                Is there a %in% operator across multiple columns
                            
                                Including R help in knitr output
                            
                                expect_error_or_warning in testthat?
                            
                                R remove multiple text strings in data frame
                            
                                Import newest csv file in directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find local minimum in bimodal distribution with r

Tags:

r

kernel

statistics

distribution

kernel-density

Iris

People also ask

1 Answers

jlhoward

Recent Activity

Donate For Us