Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reduce PDF file size of plots by filtering hidden objects

While producing scatter plots of many points in R (using ggplot() for example), there might be many points that are behind the others and not visible at all. For instance see the plot below:

enter image description here

This is a scatter plot of several hundreds of thousands points, but most of them are behind the other points. The problem is when casting the output to a vector file (a PDF file for example), the invisible points make the file size so big, and increase memory and cpu usage while viewing the file.

A simple solution is to cast the output to a bitmap picture (TIFF or PNG for example), but they lose the vector quality and can be even larger in size. I tried some online PDF compressors, but the result was the same size as my original file.

Is there any good solution? For example some way to filter the points that are not visible, possibly during generating plot or after it by editing PDF file?

like image 881
Ali Avatar asked May 21 '13 08:05

Ali


2 Answers

As a start you can do something like this:

set.seed(42)
DF <- data.frame(x=x<-runif(1e6),y=x+rnorm(1e6,sd=0.1))
plot(y~x,data=DF,pch=".",cex=4)

enter image description here

PDF size: 6334 KB

DF2 <- data.frame(x=round(DF$x,3),y=round(DF$y,3))
DF2 <- DF[!duplicated(DF2),]
nrow(DF2)
#[1] 373429
plot(y~x,data=DF2,pch=".",cex=4)

enter image description here

PDF size: 2373 KB

With the rounding you can control how many values you want to remove. You only need to modify this to handle the different colours.

like image 194
Roland Avatar answered Oct 21 '22 15:10

Roland


Simply saving the plot as a high-res png file will very drastically cut the size, while keeping the quality more than good enough. At least I've never had journals complain about any of the png's I sent them, just keep sure to use > 600 dpi.

like image 45
Paul Hiemstra Avatar answered Oct 21 '22 15:10

Paul Hiemstra