Plotting histogram of a big matrix in ggplot2 is 20x slower than base hist()

Tags:

I have a numeric matrix, about 10M values and need just to show the distribution of values in a histogram. In base R, hist() does this quite fast. But if I want to use ggplot, it's much slower (I also have to melt the matrix first, but it's not the time-limiting step). Is there any way to make it fast with ggplot?

require(microbenchmark)
require(ggplot2)


mtx1 <- matrix(rnorm(6e4*150), nrow = 6e4)
df1 <- reshape2::melt(mtx1)

g_hist <- function(df){
  print(ggplot(df, aes(x=value)) + geom_histogram(bins=30))
}

print(microbenchmark(
  hist(mtx1), 
  g_hist(df1), 
times=3L 
), signif=3)


# Unit: milliseconds
#        expr  min   lq mean median   uq  max neval
#  hist(mtx1)  384  471  530    559  603  647     3
# g_hist(df1) 7710 8000 8190   8300 8440 8570     3

563

asked Jun 15 '19 03:06

Vasily A

1 Answers

Here is solution where the histogram bins and bin counts are calculated using the base R hist() function. (Computing the bins does indeed appear to be source of the bottleneck in geom_histogram()).

Then I use the computed bin counts and bin boundaries along with geom_rect() to draw a histogram that looks pretty much identical to those produced by geom_histogram().

The required time is still greater than base hist(), but by 1.5-fold instead of 20-fold.

quick_hist = function(values_vec, breaks=50) {
    res = hist(values_vec, plot=FALSE, breaks=breaks)

    dat = data.frame(xmin=head(res$breaks, -1L),
                     xmax=tail(res$breaks, -1L),
                     ymin=0.0,
                     ymax=res$counts)

    ggplot(dat, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax)) +
    geom_rect(size=0.5, colour="grey30", fill="grey80")
}


ggsave("quick_hist.png", 
       plot=quick_hist(mtx1) + theme_bw(), 
       width=8, height=4, dpi=150)


print(microbenchmark(hist(mtx1), 
                     g_hist(df1), 
                     print(quick_hist(mtx1, breaks=30)),
                     times=5L), signif=3)

# Unit: milliseconds
#                                  expr  min   lq mean median   uq  max neval
#                            hist(mtx1)  264  270  305    298  332  359     5
#                           g_hist(df1) 5740 5760 6180   5770 5920 7700     5
#  print(quick_hist(mtx1, breaks = 30))  407  418  440    433  440  503     5

enter image description here

answered Nov 11 '22 20:11

bdemarest

Related questions
                            
                                dplyr: case_when() over multiple columns with multiple conditions
                            
                                'as.tibble' causes error in tibble 2.0.1 but not 1.4.2
                            
                                gganimate barchart: smooth transition when bar is replaced
                            
                                Animated sorted bar chart: problem with overlapping bars
                            
                                knitr generating errors in document but generates figures correctly regardless
                            
                                Drawing a contour line around connected cells in a heatmap in R
                            
                                Keep auxiliary TeX files when rendering a rmarkdown document
                            
                                geom_point() rainbow colors
                            
                                Find all subsequences with specific length in sequence of numbers in R
                            
                                R datatable search option doesn't handle exotic encoding (latin1)
                            
                                Remove characters which repeat more than twice in a string [duplicate]
                            
                                Format negative currency values correctly with minus sign before the dollar sign
                            
                                How to change axis labels using with visreg along with ggplot2
                            
                                Implementing additional constraint variables in integer programming using lpSolve
                            
                                Building a stacked histogram with gganimate
                            
                                Labeling conditional events in dplyr with sequential data
                            
                                How to use cumsum-Lapply when i+1 column is needed?
                            
                                Very Fast string fuzzy matching in R
                            
                                Draggable interactive bar chart Rshiny
                            
                                How to solve prcomp.default(): cannot rescale a constant/zero column to unit variance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Plotting histogram of a big matrix in ggplot2 is 20x slower than base hist()

Tags:

r

ggplot2

Vasily A

People also ask

1 Answers

bdemarest

Recent Activity

Donate For Us