How can I plot a very large data set in R?
I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?
As of 2022, the best solution is to use DuckDB (there is an R connector), it allows you to query very large datasets (CSV, parquet, among others), and it comes with many functions to compute summary statistics. The idea is to use DuckDB to compute those statistics, load such statistics into R/Python/Julia, and plot.
One of the easiest ways to deal with Big Data in R is simply to increase the machine's memory. Today, R can address 8 TB of RAM if it runs on 64-bit machines. That is in many situations a sufficient improvement compared to about 2 GB addressable RAM on 32-bit machines.
Scatter plots are best for showing distribution in large data sets.
In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff
big-data handling package:
ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){ stopifnot(all(qs<=1 & qs>=0)) ffsort(ffv,...)->ffvs j<-(qs*(length(ffv)-1))+1 jf<-floor(j);ceiling(j)->jc rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2 }
This is an exact algorithm, so it uses sorting -- and thus may take a lot of time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With