I have a numeric matrix, about 10M values and need just to show the distribution of values in a histogram. In base R, hist()
does this quite fast. But if I want to use ggplot
, it's much slower (I also have to melt the matrix first, but it's not the time-limiting step). Is there any way to make it fast with ggplot?
require(microbenchmark)
require(ggplot2)
mtx1 <- matrix(rnorm(6e4*150), nrow = 6e4)
df1 <- reshape2::melt(mtx1)
g_hist <- function(df){
print(ggplot(df, aes(x=value)) + geom_histogram(bins=30))
}
print(microbenchmark(
hist(mtx1),
g_hist(df1),
times=3L
), signif=3)
# Unit: milliseconds
# expr min lq mean median uq max neval
# hist(mtx1) 384 471 530 559 603 647 3
# g_hist(df1) 7710 8000 8190 8300 8440 8570 3
Basic histogram with geom_histogram It is relatively straightforward to build a histogram with ggplot2 thanks to the geom_histogram() function. Only one numeric variable is needed in the input.
You can modify the number of bins using the bins argument. In the below example, we create a histogram with 7 bins.
For the histogram the default computation is stat_bin which uses 30 bins and computes the following variables: - count , the number of observations in each bin; - density , the density of observations in each bin (percentage of total / bar width); - x , the centre of the bin.
Here is solution where the histogram bins and bin counts are calculated using the base R hist()
function. (Computing the bins does indeed appear to be source of the bottleneck in geom_histogram()
).
Then I use the computed bin counts and bin boundaries along with geom_rect()
to draw a histogram that looks pretty much identical to those produced by geom_histogram()
.
The required time is still greater than base hist()
, but by 1.5-fold instead of 20-fold.
quick_hist = function(values_vec, breaks=50) {
res = hist(values_vec, plot=FALSE, breaks=breaks)
dat = data.frame(xmin=head(res$breaks, -1L),
xmax=tail(res$breaks, -1L),
ymin=0.0,
ymax=res$counts)
ggplot(dat, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax)) +
geom_rect(size=0.5, colour="grey30", fill="grey80")
}
ggsave("quick_hist.png",
plot=quick_hist(mtx1) + theme_bw(),
width=8, height=4, dpi=150)
print(microbenchmark(hist(mtx1),
g_hist(df1),
print(quick_hist(mtx1, breaks=30)),
times=5L), signif=3)
# Unit: milliseconds
# expr min lq mean median uq max neval
# hist(mtx1) 264 270 305 298 332 359 5
# g_hist(df1) 5740 5760 6180 5770 5920 7700 5
# print(quick_hist(mtx1, breaks = 30)) 407 418 440 433 440 503 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With