Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up the extraction of a large tgz file with lots of small files? [closed]

I have a tar archive (17GB) which consists of many small files (all files <1MB ). How do I use this archive.

  1. Do I extract it ? using 7-zip on my laptop says it will take 20hrs (and I think it will take even more)
  2. Can I read/browse the contents of the file without extracting it? If yes, then how?
  3. Is there any other option?

It is actually a processed wikipedia dataset on which I am supposed to perform some Natural Language Processing.

Platform Windows/Linux is not an issue; anything will do, as long as it gets the jobs done as quickly as possible.

like image 838
Vulcan Avatar asked Sep 27 '15 09:09

Vulcan


1 Answers

I suppose you have a Linux laptop or desktop on which your hugearchive.tgz file is on some local disk (not a remote network filesystem, which could be too slow). If possible, put that hugearchive.tgz file on some fast disk (SSD preferably, not magnetic rotating hard disks) and fast Linux-native file system (Ext4, XFS, BTRFS, not FAT32 or NTFS).

Notice that a .tgz file is a gnu-zipped compression of a .tar file.

Next time you get a huge archive, consider asking it in afio archive format, which has the big advantage of compressing not-too-small files individually (or perhaps ask for some SQL dump - e.g. for PostGreSQL or Sqlite or MariaDB - in compressed form).

First, you should make a list of the file names in that hugearchive.tgz gziped tar archive and ask for the total count of bytes:

 tar -tzv --totals -f hugearchive.tgz > /tmp/hugearchive-list.txt

That command will run gunzip to uncompress the .tgz file to some pipe (so won't consume a lot of disk space) and write the table-of-contents into /tmp/hugearchive-list.txt and you'll get on your stderr something like

  Total bytes read: 340048000 (331MiB, 169MiB/s)

of course the figures are fictive, you'll get much bigger ones. But you'll know what is the total cumulated size of the archive, and you'll know its table of contents. Use wc -l /tmp/hugearchive-list.txt to get the number of lines in that table of content, that is the number of files in the archive, unless some files are weirdly and maliciously named (with e.g. some newline in their filename, which is possible but weird).

My guess is that you'll process your huge archive in less than one hour. Details depend on the computer, notably the hardware (if you can afford it, use some SSD, and get at least 8Gbytes of RAM).

Then you can decide if you are able to extract all the files or not, since you know how much total size they need. Since you have the table-of-contents in /tmp/hugearchive-list.txt you can easily extract the useful files only, if so needed.


For what it is worth, on my i3770K desktop with 16Gb RAM and both SSD & disk storage, I made (for experimenting) a useless huge archive (made specifically for the purpose of answering this question, since I don't have your hugearchive.tgz file ....) with

sudo time tar czf /tmp/hugefile.tgz /bin /usr/bin /usr/local/bin /var 

and it took this time to create that archive (with all these file systems on SSD):

 719.63s user 60.44s system 102% cpu 12:40.87 total

and the produced /tmp/hugefile.tgz has 5.4 gigabytes (notice that it probably sits in the page cache).

I then tried:

time tar -tzv --totals -f /tmp/hugefile.tgz > /tmp/hugefile-list.txt

and got:

Total bytes read: 116505825280 (109GiB, 277MiB/s)
tar -tzv --totals -f /tmp/hugefile.tgz > /tmp/hugefile-list.txt
    395.77s user 26.06s system 104% cpu 6:42.43 total

and the produced /tmp/hugefile-list.txt has 2.3Mbytes (for 23Kfiles), not a big deal.

Don't use z in your tar commands if your tar archive is not GNU zipped.

Read the documentation of tar(1) (and also of time(1) if you use it, and more generally of every command you are using!) and of course use the command line (not some GUI interface), also learn some shell scripting.

BTW, you could later segregate very small files (less than 64Kbytes) and e.g. put them inside some database (perhaps some Sqlite or Redis or PostGreSQL or MongoDB database, filled with e.g. a small script) or maybe some GDBM indexed file. Notice that most file systems have some significant overhead for a big lot of small files.

Learning shell scripting and some scripting language (Python, Lua, Guile, Ocaml, Common Lisp), and basic database techniques is not a loss of time. If e.g. you are starting a PhD, it is almost a required skill set.

I don't know and don't use (and dislike) Windows, so I am obviously biased (my first Linux was some Slackware with a 0.99.12 kernel circa 1993 or early 1994), but I strongly recommend you to do all your NLP work on Linux (and keep Windows only for playing video games, when you have time for that), because scripting and combining many useful existing free software is so much easier on Linux.

like image 172
Basile Starynkevitch Avatar answered Oct 23 '22 21:10

Basile Starynkevitch