Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plotting huge data files in R?

Tags:

r

ggplot2

I have a input file that has about 20 million lines. The size of the file is about 1.2 G. Is there anyway I can plot the data in R. Some of the columns have categories, most of them are numbers.

I have tried my plotting script with a small subset of the input file about 800K lines, but even though i have about 8G of RAM, I dont seem to be able to plot all the data. Is there any simple way to do this.

like image 978
Sam Avatar asked May 29 '12 20:05

Sam


2 Answers

Without a more clear description of the kind of plot you want, it is hard to give concrete suggestions. However, in general there is no need to plot 20 million points in a plot. For example a timeseries could be represented by a splines fit, or some kind of average, e.g. aggregate hourly data to daily averages. Alternatively, you draw some subset of the data, e.g. only one point per day in the example of the timeseries. So I think your challenge is not as much getting 20M points, or even 800k, on a plot, but how to aggregate your data effectively in such a way that it conveys the message you want to tell.

like image 111
Paul Hiemstra Avatar answered Sep 27 '22 21:09

Paul Hiemstra


The package hexbin to plot hexbins instead of scatterplots for pairs of variables as suggested by Ben Bolker in Speed up plot() function for large dataset worked for me for 2 million records fairly with 4GB RAM. But it failed for 200 million records/rows for same set of variables. I tried reducing the bin size to adjust computation time vs. RAM usage but it did not help.

For 20 million records, you can try out hexbins with xbins = 20,30,40 to start with.

like image 42
KarthikS Avatar answered Sep 27 '22 22:09

KarthikS