My data are pre-processed image data and I want to seperate two classes. In therory (and hopefully in practice) the best threshold is the local minimum between the two peaks in the bimodal distributed data.
My testdata is: http://www.file-upload.net/download-9365389/data.txt.html
I tried to follow this thread: I plotted the histogram and calculated the kernel density function:
datafile <- read.table("....txt")
data <- data$V1
hist(data)
d <- density(data) # returns the density data with defaults
hist(data,prob=TRUE)
lines(d) # plots the results
But how to continue?
I would calculate the first and second derivates of the density function to find the local extrema, specifically the local minimum. However I have no idea how to do this in R and density(test)
seems not to be a normal function. Thus please help me: how can I calculate the derivates and find the local minimum of the pit between the two peaks in the density function density(test)
?
There exist two way of detecting bimodality of data in R. One of them is using is. bimodal() function available in LaplacesDemon package (Statisticat, 2021). The other one is the bimodality coefficient as specified in Pfister et al.
A better way to analyze and interpret bimodal distributions is to simply break the data into two separate groups, then analyze the center and the spread for each group.
There are a few ways to do this.
First, using d
for the density as in your question, d$x
and d$y
contain the x and y values for the density plot. The minimum occurs when the derivative dy/dx = 0. Since the x-values are equally spaced, we can estimate dy using diff(d$y)
, and seek d$x
where abs(diff(d$y))
is minimized:
d$x[which.min(abs(diff(d$y)))]
# [1] 2.415785
The problem is that the density curve could also be maximized when dy/dx = 0. In this case the minimum is shallow but the maxima are peaked, so it works, but you can't count on that.
So a second way uses optimize(...)
which seeks a local minimum in a given interval. optimize(...)
needs a function as argument, so we use approxfun(d$x,d$y)
to create an interpolation function.
optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
# [1] 2.415791
Finally, we show that this is indeed the minimum:
hist(data,prob=TRUE)
lines(d, col="red", lty=2)
v <- optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
abline(v=v, col="blue")
Another approach, which is preferred actually, uses k-means clustering.
df <- read.csv(header=F,"data.txt")
colnames(df) = "X"
# bimodal
km <- kmeans(df,centers=2)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) +
geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
binwidth=0.5, color="grey50")+
stat_density(geom="line", color="red")
The data actually looks more trimodal than bimodal.
# trimodal
km <- kmeans(df,centers=3)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) +
geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
binwidth=0.5, color="grey50")+
stat_density(geom="line", color="red")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With