Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much space is needed to download entire CRAN repository?

Tags:

r

cran

How much space is needed to download the entire CRAN Repository? Keeping all the files zipped, how large would a folder holding all the packages be? I can't find a clear answer to this question. I've read about 3GB, but I've also come across 200GB.

like image 349
Easthaven Avatar asked Sep 22 '16 22:09

Easthaven


1 Answers

Per my comment:

rsync -rtlzv --delete  cran.r-project.org::CRAN/bin/macosx/mavericks/contrib/3.2/ /cran/bin/macosx/mavericks/contrib/3.2/
rsync -rtlzv --delete  cran.r-project.org::CRAN/bin/macosx/mavericks/contrib/3.3/ /cran/bin/macosx/mavericks/contrib/3.3/
rsync -rtlzv --delete  cran.r-project.org::CRAN/doc/ /cran/doc/
rsync -rtlzv --delete  cran.r-project.org::CRAN/bin/macosx/tools/ /cran/bin/macosx/tools/
rsync -rtlzv --delete  cran.r-project.org::CRAN/web/ /cran/web/
rsync -rtlzv --delete  cran.r-project.org::CRAN/src/ /cran/src/
rsync -tlzv --delete  -a --include="NEWS" --include="*.shtml" --include="*.html" --include="*.pkg" --include="*.dmg" --include="*.gz" --exclude="*" cran.r-project.org::CRAN/bin/macosx/ /cran/bin/macosx/
rsync -tlzv --delete  -a --include="*.html" --include="*.shtml" --include="*.svg" --include="*.png" --exclude="*" cran.r-project.org::CRAN/ /cran/
rsync -rtlzv --delete  cran.r-project.org::CRAN/src/contrib/PACKAGES.gz /cran/src/contrib/PACKAGES.gz

(which is not an optimized set of rsync statements) gets me a very fully functional local CRAN repo that supports all of my systems quite well. I let the sole, nigh useless Windows VM I keep for testing use RStudio's mirror since I have no use for it's cruft on this system, but my linux and macOS systems work flawlessly with this when it comes to pkgs.

As I said in the comment, this is under 60GB.

To make it fully functional, you have to setup a web server and it's a PITA to use anything else but Apache given the 1990's web tech setup CRAN seems determined to maintain. Said config is an exercise left to the reader.

Of note: it's worth the time doing the mirror and exploring the nuggets around the filesystem. Many RDS files for "accounting" and other insights you won't get from starting at the 1990's HTML files on the web site.

Using your own, local mirror reduces the information leakage and stops you from contributing to the (IMO very inaccurate) "# downloads" package counts that show up on GitHub README.md badges and keeps your privacy for those mirrors that don't adhere to not keeping logs or mining your pkg usage.

like image 66
hrbrmstr Avatar answered Sep 20 '22 03:09

hrbrmstr