Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently plotting millions of data points in R

Tags:

plot

r

ggplot2

I'm trying to plot some million data points in R. I'm currently using ggplot2 (but I'm open to suggestions of alternate packages). The problem is that the graph takes too long to render (often upwards of a minute). I'm looking for ways to do this faster -- in real time ideally. I would appreciate any help -- attaching code to the question for clarity.

Creating a (random) data frame with ~500000 data points:

letters <- c("A", "B", "C", "D", "E", "F", "G")
myLetters <- sample(x = letters, size = 100000, replace = T)
direction <- c("x", "y", "z")
factor1 <- sample(x = direction, size = 100000, replace = T)
factor2 <- runif(100000, 0, 20)
factor3 <- runif(100000, 0, 100)
decile <- sample(x = 1:10, size = 100000, replace = T)


new.plot.df <- data.frame(letters = myLetters, factor1 = factor1, factor2 = factor2, 
                      factor3 = factor3, decile = decile)

Now, plotting the data:

color.plot <- ggplot(new.plot.df, aes(x = factor3, y = factor2, color = factor1)) +
geom_point(aes(alpha = factor2)) +
facet_grid(decile ~ letters)

enter image description here

How do I make the rendering faster?

like image 409
Karan Tibrewal Avatar asked Jan 20 '16 14:01

Karan Tibrewal


People also ask

How do I plot a large data in R?

As of 2022, the best solution is to use DuckDB (there is an R connector), it allows you to query very large datasets (CSV, parquet, among others), and it comes with many functions to compute summary statistics. The idea is to use DuckDB to compute those statistics, load such statistics into R/Python/Julia, and plot.

How do I increase PCH size in R?

Change R base plot point shapes You can change this to pch = 19 (solid circle) or to pch = 21 (filled circle). To change the color and the size of points, use the following arguments: col : color (hexadecimal color code or color name). For example, col = "blue" or col = "#4F6228" .

What does GG plot do in R?

ggplot2 is a plotting package that provides helpful commands to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties.

How do you plot points in R?

To add new points to an existing plot, use the points() function. The points function has many similar arguments to the plot() function, like x (for the x-coordinates), y (for the y-coordinates), and parameters like col (border color), cex (point size), and pch (symbol type).


2 Answers

There are two main sources of slowness in R plotting:

  1. graphics device and backend in general
  2. plotting too much of complicated shapes

Graphical back-end can be altered using appropriate device-opening and backend-selection commands -- for me, this usually helps:

options(bitmapType='cairo')  #set the drawing backend, this may speed up PNG rendering
x11(type='cairo')   #drawing to X11 window using cairo is the fastest interactive output for me

(X11 is not available on windows and a little confusing in Rstudio, but that's a different story)

Plotting simpler shapes helps quite a lot. ggplot uses some variant of pch=19 or pch=20 by default, which are way too slow because of anti-aliasing. You can usually get about 10x faster rendering by using pch='.' (which is just a single non-aliased pixel) or pch=16 (which is a small non-aliased circle). That also applies for ggplot with shape='.' and shape=16, respectively. If you have a lot of points and set appropriately lower alpha, you'll get the "anti-aliasing" for free.

For me, just switching the graphical backend and setting different point shape improved drawing of 1 million points from around 30 minutes to seconds. 500k data points should be rendered in under a second.

EDIT (Jan 2020): I recently made a library that speeds this up even more: https://github.com/exaexa/scattermore

like image 143
exa Avatar answered Oct 17 '22 06:10

exa


In general there are two strategies that I use for this:

1) As described in the comments, taking a reasonable descriptive sample of your data is not going to affect your plot and you will reduce the number of points to render.

2) One trick that I use is actually to create the object without displaying the plot and instead save the plot into a PNG image. This actually speeds up the process by a lot because when you open the image it's going to be a raster rather than a vectorial image.

like image 30
nbafrank Avatar answered Oct 17 '22 05:10

nbafrank