Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wrapping R's plot function (or ggplot2) to prevent plotting of large data sets

Rather than ask how to plot big data sets, I want to wrap plot so that code that produces a lot of plots doesn't get hammered when it is plotting a large object. How can I wrap plot with a very simple manner so that all of its functionality is preserved, but first tests to determine whether or not the object being passed is too large?

This code works for very vanilla calls to plot, but it's missing the same generality as plot (see below).

myPlot <- function(x, ...){
    isBad <- any( (length(x) > 10^6) || (object.size(x) > 8*10^6) || (nrow(x) > 10^6) )
    if(is.na(isBad)){isBad = FALSE}
    if(isBad){
        stop("No plots for you!")
    }
    return(plot(x, ...))
}

x = rnorm(1000)
x = rnorm(10^6 + 1)

myPlot(x)

An example where this fails:

x = rnorm(1000)
y = rnorm(1000)
plot(y ~ x)
myPlot(y ~ x)

Is there some easy way to wrap plot to enable this checking of the data to be plotted, while still passing through all of the arguments? If not, then how about ggplot2? I'm an equal opportunity non-plotter. (In the cases where the dataset is large, I will use hexbin, sub-sampling, density plots, etc., but that's not the focus here.)


Note 1: When testing ideas, I recommend testing for size > 100 (or set a variable, e.g. myThreshold <- 1000), rather than versus a size of > 1M - otherwise there will be a lot of pain in hitting the slow plotting. :)

like image 350
Iterator Avatar asked Oct 15 '11 17:10

Iterator


People also ask

How do I plot a large data in R?

As of 2022, the best solution is to use DuckDB (there is an R connector), it allows you to query very large datasets (CSV, parquet, among others), and it comes with many functions to compute summary statistics. The idea is to use DuckDB to compute those statistics, load such statistics into R/Python/Julia, and plot.

What can ggplot2 do?

ggplot2 is a plotting package that provides helpful commands to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties.

How do I suppress a plot in R?

In the same way that you can use type = "n" to suppress the points and axes = FALSE to suppress the axes. I was also thinking of lm() , which outputs the results, and lm0 <- lm() , which doesn't. Here, assigning the function to a variable suppresses any output to the GUI.

How do you plot a function in R?

The plot() function in R isn't a single defined function but a placeholder for a family of related functions. The exact function being called will depend upon the parameters used. At its simplest, plot() function simply plots two vectors against each other. This gives a simple plot for y = x^2.


1 Answers

The problem you have is that as currently coded, myplot() assumes x is a data object, but then you try to pass it a formula. R's plot() achieves this via methods - when x is a formula, the plot.formula() method gets dispatched to instead of the basic plot.default() method.

You need to do the same:

myplot <- function(x, ...)
    UseMethod("myplot")

myplot.default <- function(x, ....) {
    isBad <- any((length(x) > 10^6) || (object.size(x) > 8*10^6) || 
                    (nrow(x) > 10^6))
    if(is.na(isBad)){isBad = FALSE}
    if(isBad){
        stop("No plots for you!")
    }
    invisible(plot(x, ...))
}

myplot.formula <- function(x, ...) {
    ## code here to process the formula into a data object for plotting
    ....
    myplot.default(processed_x, ...)
}

You can steal code from plot.formula() to use in the code needed to process x into an object. Alternatively, you can roll your own following the standard non-standard evaluation rules (PDF).

like image 200
Gavin Simpson Avatar answered Sep 29 '22 23:09

Gavin Simpson