Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Visualising big set of points with third feature as a color - a way to improve a speed

I have a pretty big dataset (around 5e5 rows) of (x, y) coordinates with additional feature z. It's something like this:

x <- rnorm(1e6, 0, 5)
y <- rnorm(1e6, 0, 10)
dist <- sqrt(x^2 + y^2)
z <- exp(-(dist / 8)^2)

I want to plot them with a z feature used as a color aesthetic. But simple geom_point takes a while with such a big dataset:

data.frame(x, y, z) %>% 
  ggplot() + geom_point(aes(x, y, color = z)) 

enter image description here

So I think I need a way to aggregate points in some way. One approach would be to divide a plane to some small squares and average all the z values for points that lie in a square. But it can be a little cumbersome in the long term and it's probably better to use some of already available tools. So I thought about geom_hex as a geom that would look good in my case. But fill aesthetic is setup to count as default. So my questions are:

  • Can default fill value of geom_hex be easily changed to an average of z feature?
  • If not, how can I create hexagons instead of squares, so that z value can be averaged within hexagons and then plotted?
  • Is there any other way to improve a speed of plotting such a dataset?

Edit:

Comparison of proposed solutions:

library(microbenchmark)
microbenchmark(
  'stat_summary_hex' = {data.frame(x, y, z) %>%                                                                                                   
    ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))},
  'round_and_group' = {data.frame(x, y, z) %>%                                                   
      mutate(x=round(x, 0), y=round(y, 0)) %>%                                  
      group_by(x,y) %>%                                                         
      summarize(z = mean(z)) %>%                                                
      ggplot() + geom_hex(aes(x, y, fill = z), stat="identity")}
)

Unit: milliseconds
             expr        min        lq       mean     median        uq        max neval
 stat_summary_hex   2.243791   2.38539   2.454039   2.426123   2.50871   2.963176   100
  round_and_group 183.785828 186.38851 188.296828 187.347476 189.10874 218.668487   100
like image 938
Kuba_ Avatar asked Oct 21 '18 07:10

Kuba_


3 Answers

stat="identity" is used on bar/column charts to use value instead of count. This appears to work with geom_hex


library(dplyr)                                                            
library(ggplot2)                                                          
x <- rnorm(1e4, 0, 5)                                                     
y <- rnorm(1e4, 0, 10)                                                    
dist <- sqrt(x^2 + y^2)                                                   
z <- exp(-(dist / 8)^2)                                                   

##  Summarize to rounded x and y, calculate mean(z), use stat = "identity"
data.frame(x, y, z) %>%                                                   
mutate(x=round(x, 0), y=round(y, 0)) %>%                                  
group_by(x,y) %>%                                                         
summarize(z = mean(z)) %>%                                                
ggplot() + geom_hex(aes(x, y, fill = z), stat="identity")                 

like image 100
Andrew Lavers Avatar answered Nov 07 '22 07:11

Andrew Lavers


You might consider using rasters for this:

library(raster)
library(rasterVis)

p = data.frame(x, y, z)
coordinates(p) = ~x+y

r = raster(nrows=500, ncols=500, ext = extent(c(range(c(x,y)), range(c(x,y)))), crs=CRS("+init=epsg:28992"))
r = rasterize(p, r, 'z', fun=mean)
levelplot(r)

enter image description here

NB If you don't want to use RasterVis you can plot with ggplot or base graphics if you prefer. E.g. with ggplot, we can do

ggplot(as.data.frame(r, xy = TRUE) ) +
  geom_raster(aes(x, y, fill = layer)) +
  scale_fill_continuous(na.value="white")

enter image description here

like image 3
dww Avatar answered Nov 07 '22 06:11

dww


Maybe it could help stat_summary_hex(), or stat_summary_2d().

They are similar to stat_summary(), the data are divided in bins with x and y, then summarised by z, using the function specified in stat_summary_hex() (or stat_summary_2d()).

library(tidyverse)
data.frame(x, y, z) %>%  
# here you can specify the function that welcomes the z parameter                                                                                              
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))

enter image description here

It is going to answer the your second question (hex), and your third question (seems ok with perfomance as you stated), in place of using geom_hex() (so it seems there is a trade of between geom_hex() and velocity).

EDIT

Looking at your questions, I've microbenchmarked the function with different values:

Unit: milliseconds
  expr      min       lq     mean   median       uq       max neval
 3.5e5 205.0363 214.6925 236.8149 225.2286 238.6536  494.7897   100
   1e6 575.4861 597.4161 665.4396 620.9151 702.1622 1143.7011   100

Also, you can also specify the bins, to have more or less "precise" hexes. The default value should be 30, that means it's going to plot the points in an area of 30 * 30 hexes:

data.frame(x, y, z) %>%                                                                                            
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x), bins = 60)

As example (here the multiplot() function, if necessary):

set.seed(1)
x <- rnorm(1e4, 0, 5)                                                     
y <- rnorm(1e4, 0, 10)                                                    
dist <- sqrt(x^2 + y^2)                                                   
z <- exp(-(dist / 8)^2) 

library(tidyverse)

a1 <- data.frame(x, y, z) %>% 
      ggplot() + geom_point(aes(x, y, color = z)) 

b1 <- data.frame(x, y, z) %>%  
     ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))

c1 <- data.frame(x, y, z) %>%  
      ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x), bins = 60)

multiplot(a1,b1,c1, cols = 3)

enter image description here

As you can see, the more you add hexes, the most you are closer to your original points.


With data:

x <- rnorm(1e4, 0, 5)                                                     
y <- rnorm(1e4, 0, 10)                                                    
dist <- sqrt(x^2 + y^2)                                                   
z <- exp(-(dist / 8)^2) 
like image 2
s__ Avatar answered Nov 07 '22 05:11

s__