I have a pretty big dataset (around 5e5
rows) of (x, y)
coordinates with additional feature z
. It's something like this:
x <- rnorm(1e6, 0, 5)
y <- rnorm(1e6, 0, 10)
dist <- sqrt(x^2 + y^2)
z <- exp(-(dist / 8)^2)
I want to plot them with a z
feature used as a color aesthetic. But simple geom_point
takes a while with such a big dataset:
data.frame(x, y, z) %>%
ggplot() + geom_point(aes(x, y, color = z))
So I think I need a way to aggregate points in some way. One approach would be to divide a plane to some small squares and average all the z
values for points that lie in a square. But it can be a little cumbersome in the long term and it's probably better to use some of already available tools. So I thought about geom_hex
as a geom that would look good in my case. But fill
aesthetic is setup to count
as default. So my questions are:
fill
value of geom_hex
be easily changed to an average of z
feature? z
value can be averaged within hexagons and then plotted?Comparison of proposed solutions:
library(microbenchmark)
microbenchmark(
'stat_summary_hex' = {data.frame(x, y, z) %>%
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))},
'round_and_group' = {data.frame(x, y, z) %>%
mutate(x=round(x, 0), y=round(y, 0)) %>%
group_by(x,y) %>%
summarize(z = mean(z)) %>%
ggplot() + geom_hex(aes(x, y, fill = z), stat="identity")}
)
Unit: milliseconds
expr min lq mean median uq max neval
stat_summary_hex 2.243791 2.38539 2.454039 2.426123 2.50871 2.963176 100
round_and_group 183.785828 186.38851 188.296828 187.347476 189.10874 218.668487 100
stat="identity"
is used on bar/column charts to use value instead of count. This appears to work with geom_hex
library(dplyr)
library(ggplot2)
x <- rnorm(1e4, 0, 5)
y <- rnorm(1e4, 0, 10)
dist <- sqrt(x^2 + y^2)
z <- exp(-(dist / 8)^2)
## Summarize to rounded x and y, calculate mean(z), use stat = "identity"
data.frame(x, y, z) %>%
mutate(x=round(x, 0), y=round(y, 0)) %>%
group_by(x,y) %>%
summarize(z = mean(z)) %>%
ggplot() + geom_hex(aes(x, y, fill = z), stat="identity")
You might consider using rasters for this:
library(raster)
library(rasterVis)
p = data.frame(x, y, z)
coordinates(p) = ~x+y
r = raster(nrows=500, ncols=500, ext = extent(c(range(c(x,y)), range(c(x,y)))), crs=CRS("+init=epsg:28992"))
r = rasterize(p, r, 'z', fun=mean)
levelplot(r)
NB If you don't want to use RasterVis you can plot with ggplot or base graphics if you prefer. E.g. with ggplot, we can do
ggplot(as.data.frame(r, xy = TRUE) ) +
geom_raster(aes(x, y, fill = layer)) +
scale_fill_continuous(na.value="white")
Maybe it could help stat_summary_hex()
, or stat_summary_2d()
.
They are similar to stat_summary()
, the data are divided in bins with x
and y
, then summarised by z
, using the function specified in stat_summary_hex()
(or stat_summary_2d()
).
library(tidyverse)
data.frame(x, y, z) %>%
# here you can specify the function that welcomes the z parameter
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))
It is going to answer the your second question (hex), and your third question (seems ok with perfomance as you stated), in place of using geom_hex()
(so it seems there is a trade of between geom_hex()
and velocity).
EDIT
Looking at your questions, I've microbenchmarked the function with different values:
Unit: milliseconds
expr min lq mean median uq max neval
3.5e5 205.0363 214.6925 236.8149 225.2286 238.6536 494.7897 100
1e6 575.4861 597.4161 665.4396 620.9151 702.1622 1143.7011 100
Also, you can also specify the bins, to have more or less "precise" hexes. The default value should be 30, that means it's going to plot the points in an area of 30 * 30 hexes:
data.frame(x, y, z) %>%
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x), bins = 60)
As example (here the multiplot()
function, if necessary):
set.seed(1)
x <- rnorm(1e4, 0, 5)
y <- rnorm(1e4, 0, 10)
dist <- sqrt(x^2 + y^2)
z <- exp(-(dist / 8)^2)
library(tidyverse)
a1 <- data.frame(x, y, z) %>%
ggplot() + geom_point(aes(x, y, color = z))
b1 <- data.frame(x, y, z) %>%
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x))
c1 <- data.frame(x, y, z) %>%
ggplot( aes(x, y, z=z )) + stat_summary_hex(fun = function(x) mean(x), bins = 60)
multiplot(a1,b1,c1, cols = 3)
As you can see, the more you add hexes, the most you are closer to your original points.
With data:
x <- rnorm(1e4, 0, 5)
y <- rnorm(1e4, 0, 10)
dist <- sqrt(x^2 + y^2)
z <- exp(-(dist / 8)^2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With