Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plotting of very large data sets in R

How can I plot a very large data set in R?

I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?

like image 640
Daniel Arndt Avatar asked Dec 02 '10 23:12

Daniel Arndt


People also ask

How do I plot a large data in R?

As of 2022, the best solution is to use DuckDB (there is an R connector), it allows you to query very large datasets (CSV, parquet, among others), and it comes with many functions to compute summary statistics. The idea is to use DuckDB to compute those statistics, load such statistics into R/Python/Julia, and plot.

How do you handle a large data set in R?

One of the easiest ways to deal with Big Data in R is simply to increase the machine's memory. Today, R can address 8 TB of RAM if it runs on 64-bit machines. That is in many situations a sufficient improvement compared to about 2 GB addressable RAM on 32-bit machines.

Which graph is best for large data sets?

Scatter plots are best for showing distribution in large data sets.


1 Answers

In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff big-data handling package:

ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){  stopifnot(all(qs<=1 & qs>=0))  ffsort(ffv,...)->ffvs  j<-(qs*(length(ffv)-1))+1  jf<-floor(j);ceiling(j)->jc  rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2 } 

This is an exact algorithm, so it uses sorting -- and thus may take a lot of time.

like image 178
mbq Avatar answered Oct 11 '22 01:10

mbq