Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

plot/ggplot2 - Fill area with too many points

Tags:

plot

r

ggplot2

Final implementation - not finished but heading the right way

Idea/Problem: You have a plot with many overlapping points and want to replace them by a plain area, therefore increasing performance viewing the plot.

Possible implementation: Calculate a distance matrix between all points and connect all points below a specified distance.

Todo/Not finished: This currently works for manually set distances depending on size of the printed plot. I stopped here because the outcome didnt meet my aesthetic sense.

Minimal example with intermediate plots

set.seed(074079089)
n.points <- 3000

mat <- matrix(rnorm(n.points*2, 0,0.2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")

d.mat <- dist(mat)
fit.mat <-hclust(d.mat, method = "single")
lims <- c(-1,1)
real.lims <- lims*1.1               ## ggplot invokes them approximately

# An attempt to estimate the point-sizes, works for default pdfs pdf("test.pdf")
cutsize <- sum(abs(real.lims))/100  
groups <- cutree(fit.mat, h=cutsize) # cut tree at height cutsize
# plot(fit.mat) # display dendogram

# draw dendogram with red borders around the 5 clusters
# rect.hclust(fit.mat, h=cutsize, border="red")

library(ggplot2)
df <- data.frame(mat)
df$groups <- groups
plot00 <- ggplot(data=df, aes(x,y, col=factor(groups))) + 
    geom_point() + guides(col=FALSE) +  xlim(lims) + ylim(lims)+ 
    ggtitle("Each color is a group")
pdf("plot00.pdf")
print(plot00)
dev.off()

plot00 - points with group color

# If less than 4 points are connected, show them seperately
t.groups <- table(groups)   # how often which group
drop.group <- as.numeric(names(t.groups[t.groups<4]))   # groups with less than 4 points are taken together
groups[groups %in% drop.group] <- 0                     # in group 0
df$groups <- groups
plot01 <- ggplot(data=df, aes(x,y, col=factor(groups))) + 
    geom_point() + xlim(lims)+ ylim(lims) + 
    scale_color_hue(l=10)
pdf("plot01.pdf")
print(plot01)
dev.off()

plot01 - all single points in one group

find_hull <- function(df_0) 
{
    return(df_0[chull(df_0$x, df_0$y), ])
}


library(plyr)
single.points.df <- df[df$groups == 0 , ]
connected.points.df <- df[df$groups != 0 , ]
hulls <- ddply(connected.points.df, "groups", find_hull) #  for all groups find a hull
plot02 <- ggplot() + 
    geom_point(data=single.points.df, aes(x,y, col=factor(groups))) + 
    xlim(lims)+ ylim(lims) + 
    scale_color_hue(l=10)
pdf("plot02.pdf")
print(plot02)
dev.off()

plot02 - only "single"-points (less than 4 connected points)

plot03 <- plot02
for(grp in names(table(hulls$groups)))
{
    plot03 <- plot03 + geom_polygon(data=hulls[hulls$groups==grp, ],
                                    aes(x,y), alpha=0.4)
}
# print(plot03)
plot01 <- plot01 + theme(legend.position="none")
plot03 <- plot03 + theme(legend.position="none")
# multiplot(plot01, plot03, cols=2)
pdf("plot03.pdf")
print(plot03)
dev.off()

plot03 - final

Initial Question

I have a (maybe odd) question.

In some plots, I have thousands of points in my analysis. To display them, the pc takes quite a bit of time because there are so many points. After now, many of these points can overlap, I have a filled area (which is fine!). To save time/effort displaying, it would be usefull to just fill this area but plotting each point on its own.

I know there are possibilities in heatmaps and so on, but this is not the idea I have in mind. My idea is something like:

#plot00: ggplot with many many points and a filled area of points
plot00 <- plot00 + fill.crowded.areas()

# with plot(), I sadly have an idea how to manage it

Any ideas? Or is this nothing anyone would do anytime?

# Example code
# install.packages("ggplot2")
library(ggplot2)

n.points <- 10000
mat <- matrix(rexp(n.points*2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")
df <- data.frame(mat)
plot00 <- ggplot(df, aes(x=x, y=y)) + 
    theme_bw()  +                       # white background, grey strips
    geom_point(shape=19)# Aussehen der Punkte

print(plot00)

ggplot2

# NO ggplot2
plot(df, pch=19)

plot

Edit:
To have density-plots like mentioned by fdetsch (how can I mark the name?) there are some questions concerning this topic. But this is not the thing I want exactly. I know my concern is a bit strange, but the densities make a plot more busy sometimes as necessary.

Links to topics with densities:

Scatterplot with too many points
High Density Scatter Plots

like image 317
groebsgr Avatar asked Jan 27 '16 10:01

groebsgr


2 Answers

How about using panel.smoothScatter from lattice? It displays a certain number of points in low-density regions (see argument 'nrpoints') and everywhere else, point densities are displayed rather than single (and possibly overlapping) points, thus providing more meaningful insights into your data. See also ?panel.smoothScatter for further information.

## load 'lattice'
library(lattice)

## display point densities
xyplot(y ~ x, data = df, panel = function(x, y, ...) {
  panel.smoothScatter(x, y, nbin = 250, ...)
})

enter image description here

like image 119
fdetsch Avatar answered Sep 22 '22 13:09

fdetsch


You could use a robust estimator to estimate the location of the majority of your points and plot the convex hull of the points as follows:

set.seed(1337)
n.points <- 500
mat <- matrix(rexp(n.points*2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")
df <- data.frame(mat)

require(robustbase)
my_poly <- function(data, a, ...){
  cov_rob = covMcd(data, alpha = a)
  df_rob = data[cov_rob$best,]
  ch = chull(df_rob$x, df_rob$y)
  geom_polygon(data = df_rob[ch,], aes(x,y), ...)
}

require(ggplot2)
ggplot() + 
  geom_point(data=df, aes(x,y)) +
  my_poly(df, a = 0.5, fill=2, alpha=0.5) +
  my_poly(df, a = 0.7, fill=3, alpha=0.5) 

This leads to:

enter image description here

by controlling the alpha-value of covMcd you can increase/decrease the size of the area. See ?robustbase::covMcd for details. Btw.: Mcd stands for Minimum Covariance Determinant. Instead of it you can also use MASS::cov.mve to calculate the minimum valume ellipsoid with MASS::cov.mve(..., quantile.used=-percent of points within the ellipsoid.

For 2+ classes:

my_poly2 <- function(data, a){
  cov_rob = covMcd(data, alpha = a)
  df_rob = data[cov_rob$best,]
  ch = chull(df_rob[,1], df_rob[,2])
  df_rob[ch,]
}

ggplot(faithful, aes(waiting, eruptions, color = eruptions > 3)) +
  geom_point() + 
  geom_polygon(data = my_poly2(faithful[faithful$eruptions > 3,], a=0.5), aes(waiting, eruptions), fill = 2, alpha = 0.5) +
  geom_polygon(data = my_poly2(faithful[faithful$eruptions < 3,], a=0.5), aes(waiting, eruptions), fill = 3, alpha = 0.5)

enter image description here

Or if you are ok with un-robust ellipsoids have a look at stat_ellipse

like image 29
Rentrop Avatar answered Sep 22 '22 13:09

Rentrop