Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plotting histogram of a big matrix in ggplot2 is 20x slower than base hist()

Tags:

r

ggplot2

I have a numeric matrix, about 10M values and need just to show the distribution of values in a histogram. In base R, hist() does this quite fast. But if I want to use ggplot, it's much slower (I also have to melt the matrix first, but it's not the time-limiting step). Is there any way to make it fast with ggplot?

require(microbenchmark)
require(ggplot2)


mtx1 <- matrix(rnorm(6e4*150), nrow = 6e4)
df1 <- reshape2::melt(mtx1)

g_hist <- function(df){
  print(ggplot(df, aes(x=value)) + geom_histogram(bins=30))
}

print(microbenchmark(
  hist(mtx1), 
  g_hist(df1), 
times=3L 
), signif=3)


# Unit: milliseconds
#        expr  min   lq mean median   uq  max neval
#  hist(mtx1)  384  471  530    559  603  647     3
# g_hist(df1) 7710 8000 8190   8300 8440 8570     3
like image 563
Vasily A Avatar asked Jun 15 '19 03:06

Vasily A


People also ask

Which method is used to create a histogram using ggplot2?

Basic histogram with geom_histogram It is relatively straightforward to build a histogram with ggplot2 thanks to the geom_histogram() function. Only one numeric variable is needed in the input.

How is it possible to change the number of bins in a Ggplot histogram?

You can modify the number of bins using the bins argument. In the below example, we create a histogram with 7 bins.

What variables does Stat_bin () Compute?

For the histogram the default computation is stat_bin which uses 30 bins and computes the following variables: - count , the number of observations in each bin; - density , the density of observations in each bin (percentage of total / bar width); - x , the centre of the bin.


1 Answers

Here is solution where the histogram bins and bin counts are calculated using the base R hist() function. (Computing the bins does indeed appear to be source of the bottleneck in geom_histogram()).

Then I use the computed bin counts and bin boundaries along with geom_rect() to draw a histogram that looks pretty much identical to those produced by geom_histogram().

The required time is still greater than base hist(), but by 1.5-fold instead of 20-fold.

quick_hist = function(values_vec, breaks=50) {
    res = hist(values_vec, plot=FALSE, breaks=breaks)

    dat = data.frame(xmin=head(res$breaks, -1L),
                     xmax=tail(res$breaks, -1L),
                     ymin=0.0,
                     ymax=res$counts)

    ggplot(dat, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax)) +
    geom_rect(size=0.5, colour="grey30", fill="grey80")
}


ggsave("quick_hist.png", 
       plot=quick_hist(mtx1) + theme_bw(), 
       width=8, height=4, dpi=150)


print(microbenchmark(hist(mtx1), 
                     g_hist(df1), 
                     print(quick_hist(mtx1, breaks=30)),
                     times=5L), signif=3)

# Unit: milliseconds
#                                  expr  min   lq mean median   uq  max neval
#                            hist(mtx1)  264  270  305    298  332  359     5
#                           g_hist(df1) 5740 5760 6180   5770 5920 7700     5
#  print(quick_hist(mtx1, breaks = 30))  407  418  440    433  440  503     5

enter image description here

like image 67
bdemarest Avatar answered Nov 11 '22 20:11

bdemarest