I have a large.tar.gz
file containing about 1 million files, out of which about 1/4 of them are html files, and I want to parse a few lines of each of the html files within.
I want to avoid having to extract the contents of large large.tar.gz
into a folder and then parse the html files, instead I would like to know how can I pipe the contents of the html files in the large.tar.gz
straight to STDOUT
so that I can grep/parse out the information I want from them?
I presume there must be some magic like:
tar -special_flags large.tar.gz | grep_only_files_with_extension html | xargs -n1 head -n 99999 | ./parse_contents.pl -
Any ideas?
To write the extracted files to the standard output, instead of creating the files on the file system, use ' --to-stdout ' (' -O ') in conjunction with ' --extract ' (' --get ', ' -x '). This option is useful if you are extracting files to send them through a pipe, and do not need to preserve them in the file system.
Simply right-click the item you want to compress, mouseover compress, and choose tar. gz. You can also right-click a tar. gz file, mouseover extract, and select an option to unpack the archive.
Use this with GNU tar to extract a tgz to stdout:
tar -xOzf large.tar.gz --wildcards '*.html' | grep ...
-O, --to-stdout
: extract files to standard output
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With