I have a tar archive (17GB) which consists of many small files (all files <1MB ). How do I use this archive.
It is actually a processed wikipedia dataset on which I am supposed to perform some Natural Language Processing.
Platform Windows/Linux is not an issue; anything will do, as long as it gets the jobs done as quickly as possible.
I suppose you have a Linux laptop or desktop on which your hugearchive.tgz
file is on some local disk (not a remote network filesystem, which could be too slow). If possible, put that hugearchive.tgz
file on some fast disk (SSD preferably, not magnetic rotating hard disks) and fast Linux-native file system (Ext4, XFS, BTRFS, not FAT32 or NTFS).
Notice that a .tgz
file is a gnu-zipped compression of a .tar
file.
Next time you get a huge archive, consider asking it in afio archive format, which has the big advantage of compressing not-too-small files individually (or perhaps ask for some SQL dump - e.g. for PostGreSQL or Sqlite or MariaDB - in compressed form).
First, you should make a list of the file names in that hugearchive.tgz
gziped tar archive and ask for the total count of bytes:
tar -tzv --totals -f hugearchive.tgz > /tmp/hugearchive-list.txt
That command will run gunzip
to uncompress the .tgz
file to some pipe (so won't consume a lot of disk space) and write the table-of-contents into /tmp/hugearchive-list.txt
and you'll get on your stderr something like
Total bytes read: 340048000 (331MiB, 169MiB/s)
of course the figures are fictive, you'll get much bigger ones. But you'll know what is the total cumulated size of the archive, and you'll know its table of contents. Use wc -l /tmp/hugearchive-list.txt
to get the number of lines in that table of content, that is the number of files in the archive, unless some files are weirdly and maliciously named (with e.g. some newline in their filename, which is possible but weird).
My guess is that you'll process your huge archive in less than one hour. Details depend on the computer, notably the hardware (if you can afford it, use some SSD, and get at least 8Gbytes of RAM).
Then you can decide if you are able to extract all the files or not, since you know how much total size they need. Since you have the table-of-contents in /tmp/hugearchive-list.txt
you can easily extract the useful files only, if so needed.
For what it is worth, on my i3770K desktop with 16Gb RAM and both SSD & disk storage, I made (for experimenting) a useless huge archive (made specifically for the purpose of answering this question, since I don't have your hugearchive.tgz
file ....) with
sudo time tar czf /tmp/hugefile.tgz /bin /usr/bin /usr/local/bin /var
and it took this time to create that archive (with all these file systems on SSD):
719.63s user 60.44s system 102% cpu 12:40.87 total
and the produced /tmp/hugefile.tgz
has 5.4 gigabytes (notice that it probably sits in the page cache).
I then tried:
time tar -tzv --totals -f /tmp/hugefile.tgz > /tmp/hugefile-list.txt
and got:
Total bytes read: 116505825280 (109GiB, 277MiB/s)
tar -tzv --totals -f /tmp/hugefile.tgz > /tmp/hugefile-list.txt
395.77s user 26.06s system 104% cpu 6:42.43 total
and the produced /tmp/hugefile-list.txt
has 2.3Mbytes (for 23Kfiles), not a big deal.
Don't use z
in your tar
commands if your tar archive is not GNU zipped.
Read the documentation of tar(1) (and also of time(1) if you use it, and more generally of every command you are using!) and of course use the command line (not some GUI interface), also learn some shell scripting.
BTW, you could later segregate very small files (less than 64Kbytes) and e.g. put them inside some database (perhaps some Sqlite or Redis or PostGreSQL or MongoDB database, filled with e.g. a small script) or maybe some GDBM indexed file. Notice that most file systems have some significant overhead for a big lot of small files.
Learning shell scripting and some scripting language (Python, Lua, Guile, Ocaml, Common Lisp), and basic database techniques is not a loss of time. If e.g. you are starting a PhD, it is almost a required skill set.
I don't know and don't use (and dislike) Windows, so I am obviously biased (my first Linux was some Slackware with a 0.99.12 kernel circa 1993 or early 1994), but I strongly recommend you to do all your NLP work on Linux (and keep Windows only for playing video games, when you have time for that), because scripting and combining many useful existing free software is so much easier on Linux.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With