I have a pretty big dataset (around 5e5 rows) of (x, y) coordinates with additional feature z. It's something like this:
x <- rnorm(1e6, 0, 5)
y <- rnorm(1e6, 0, 10)
dist <- sqrt(x^2 + y^2)
z <- exp(-(dist / 8)^2)
I want to plot them with a z feature used as a color aesthetic. But simple geom_point takes a while with such a big dataset:
data.frame(x, y, z) %>% 
  ggplot() + geom_point(aes(x, y, color = z)) 

So I think I need a way to aggregate points in some way. One approach would be to divide a plane to some small squares and average all the z values for points that lie in a square. But it can be a little cumbersome in the long term and it's probably better to use some of already available tools. So I thought about geom_hex as a geom that would look good in my case. But fill aesthetic is setup to count as default. So my questions are:
fill value of geom_hex be easily changed to an average of z feature? z value can be averaged within hexagons and then plotted?Comparison of proposed solutions:
library(microbenchmark)
microbenchmark(
  'stat_summary_hex' = {data.frame(x, y, z) %>%                                                                                                   
    ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))},
  'round_and_group' = {data.frame(x, y, z) %>%                                                   
      mutate(x=round(x, 0), y=round(y, 0)) %>%                                  
      group_by(x,y) %>%                                                         
      summarize(z = mean(z)) %>%                                                
      ggplot() + geom_hex(aes(x, y, fill = z), stat="identity")}
)
Unit: milliseconds
             expr        min        lq       mean     median        uq        max neval
 stat_summary_hex   2.243791   2.38539   2.454039   2.426123   2.50871   2.963176   100
  round_and_group 183.785828 186.38851 188.296828 187.347476 189.10874 218.668487   100
stat="identity" is used on bar/column charts to use value instead of count. This appears to work with geom_hex
library(dplyr)                                                            
library(ggplot2)                                                          
x <- rnorm(1e4, 0, 5)                                                     
y <- rnorm(1e4, 0, 10)                                                    
dist <- sqrt(x^2 + y^2)                                                   
z <- exp(-(dist / 8)^2)                                                   
##  Summarize to rounded x and y, calculate mean(z), use stat = "identity"
data.frame(x, y, z) %>%                                                   
mutate(x=round(x, 0), y=round(y, 0)) %>%                                  
group_by(x,y) %>%                                                         
summarize(z = mean(z)) %>%                                                
ggplot() + geom_hex(aes(x, y, fill = z), stat="identity")                 

You might consider using rasters for this:
library(raster)
library(rasterVis)
p = data.frame(x, y, z)
coordinates(p) = ~x+y
r = raster(nrows=500, ncols=500, ext = extent(c(range(c(x,y)), range(c(x,y)))), crs=CRS("+init=epsg:28992"))
r = rasterize(p, r, 'z', fun=mean)
levelplot(r)

NB If you don't want to use RasterVis you can plot with ggplot or base graphics if you prefer. E.g. with ggplot, we can do
ggplot(as.data.frame(r, xy = TRUE) ) +
  geom_raster(aes(x, y, fill = layer)) +
  scale_fill_continuous(na.value="white")

Maybe it could help stat_summary_hex(), or stat_summary_2d().
They are similar to stat_summary(), the data are divided in bins with x and y, then summarised by z, using the function specified in stat_summary_hex() (or stat_summary_2d()).
library(tidyverse)
data.frame(x, y, z) %>%  
# here you can specify the function that welcomes the z parameter                                                                                              
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))

It is going to answer the your second question (hex), and your third question (seems ok with perfomance as you stated), in place of using  geom_hex() (so it seems there is a trade of between geom_hex() and velocity).
EDIT
Looking at your questions, I've microbenchmarked the function with different values:
Unit: milliseconds
  expr      min       lq     mean   median       uq       max neval
 3.5e5 205.0363 214.6925 236.8149 225.2286 238.6536  494.7897   100
   1e6 575.4861 597.4161 665.4396 620.9151 702.1622 1143.7011   100
Also, you can also specify the bins, to have more or less "precise" hexes. The default value should be 30, that means it's going to plot the points in an area of 30 * 30 hexes:
data.frame(x, y, z) %>%                                                                                            
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x), bins = 60)
As example (here the multiplot() function, if necessary):
set.seed(1)
x <- rnorm(1e4, 0, 5)                                                     
y <- rnorm(1e4, 0, 10)                                                    
dist <- sqrt(x^2 + y^2)                                                   
z <- exp(-(dist / 8)^2) 
library(tidyverse)
a1 <- data.frame(x, y, z) %>% 
      ggplot() + geom_point(aes(x, y, color = z)) 
b1 <- data.frame(x, y, z) %>%  
     ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))
c1 <- data.frame(x, y, z) %>%  
      ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x), bins = 60)
multiplot(a1,b1,c1, cols = 3)

As you can see, the more you add hexes, the most you are closer to your original points.
With data:
x <- rnorm(1e4, 0, 5)                                                     
y <- rnorm(1e4, 0, 10)                                                    
dist <- sqrt(x^2 + y^2)                                                   
z <- exp(-(dist / 8)^2) 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With