I'm trying to plot some million data points in R. I'm currently using ggplot2 (but I'm open to suggestions of alternate packages). The problem is that the graph takes too long to render (often upwards of a minute). I'm looking for ways to do this faster -- in real time ideally. I would appreciate any help -- attaching code to the question for clarity.
Creating a (random) data frame with ~500000 data points:
letters <- c("A", "B", "C", "D", "E", "F", "G")
myLetters <- sample(x = letters, size = 100000, replace = T)
direction <- c("x", "y", "z")
factor1 <- sample(x = direction, size = 100000, replace = T)
factor2 <- runif(100000, 0, 20)
factor3 <- runif(100000, 0, 100)
decile <- sample(x = 1:10, size = 100000, replace = T)
new.plot.df <- data.frame(letters = myLetters, factor1 = factor1, factor2 = factor2,
factor3 = factor3, decile = decile)
Now, plotting the data:
color.plot <- ggplot(new.plot.df, aes(x = factor3, y = factor2, color = factor1)) +
geom_point(aes(alpha = factor2)) +
facet_grid(decile ~ letters)
How do I make the rendering faster?
As of 2022, the best solution is to use DuckDB (there is an R connector), it allows you to query very large datasets (CSV, parquet, among others), and it comes with many functions to compute summary statistics. The idea is to use DuckDB to compute those statistics, load such statistics into R/Python/Julia, and plot.
Change R base plot point shapes You can change this to pch = 19 (solid circle) or to pch = 21 (filled circle). To change the color and the size of points, use the following arguments: col : color (hexadecimal color code or color name). For example, col = "blue" or col = "#4F6228" .
ggplot2 is a plotting package that provides helpful commands to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties.
To add new points to an existing plot, use the points() function. The points function has many similar arguments to the plot() function, like x (for the x-coordinates), y (for the y-coordinates), and parameters like col (border color), cex (point size), and pch (symbol type).
There are two main sources of slowness in R plotting:
Graphical back-end can be altered using appropriate device-opening and backend-selection commands -- for me, this usually helps:
options(bitmapType='cairo') #set the drawing backend, this may speed up PNG rendering
x11(type='cairo') #drawing to X11 window using cairo is the fastest interactive output for me
(X11 is not available on windows and a little confusing in Rstudio, but that's a different story)
Plotting simpler shapes helps quite a lot. ggplot uses some variant of pch=19
or pch=20
by default, which are way too slow because of anti-aliasing. You can usually get about 10x faster rendering by using pch='.'
(which is just a single non-aliased pixel) or pch=16
(which is a small non-aliased circle). That also applies for ggplot with shape='.'
and shape=16
, respectively. If you have a lot of points and set appropriately lower alpha, you'll get the "anti-aliasing" for free.
For me, just switching the graphical backend and setting different point shape improved drawing of 1 million points from around 30 minutes to seconds. 500k data points should be rendered in under a second.
EDIT (Jan 2020): I recently made a library that speeds this up even more: https://github.com/exaexa/scattermore
In general there are two strategies that I use for this:
1) As described in the comments, taking a reasonable descriptive sample of your data is not going to affect your plot and you will reduce the number of points to render.
2) One trick that I use is actually to create the object without displaying the plot and instead save the plot into a PNG image. This actually speeds up the process by a lot because when you open the image it's going to be a raster rather than a vectorial image.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With