how to pipe contents of large tar.gz file to STDOUT?

Tags:

bash

I have a large.tar.gz file containing about 1 million files, out of which about 1/4 of them are html files, and I want to parse a few lines of each of the html files within.

I want to avoid having to extract the contents of large large.tar.gz into a folder and then parse the html files, instead I would like to know how can I pipe the contents of the html files in the large.tar.gz straight to STDOUT so that I can grep/parse out the information I want from them?

I presume there must be some magic like:

tar -special_flags large.tar.gz | grep_only_files_with_extension html | xargs -n1 head -n 99999 | ./parse_contents.pl -

Any ideas?

865

asked Dec 09 '15 10:12

719016

1 Answers

Use this with GNU tar to extract a tgz to stdout:

tar -xOzf large.tar.gz --wildcards '*.html' | grep ...

-O, --to-stdout: extract files to standard output

answered Nov 15 '22 23:11

Cyrus

Related questions
                            
                                Execute Subprocess in Background
                            
                                What is the use of $# in Bash
                            
                                BASH blank alias to 'cd'
                            
                                How to import multiple locations to PYTHONPATH (bash)
                            
                                Run bash commands in parallel, track results and count
                            
                                Split one file into multiple files based on pattern
                            
                                Why am I getting ": No such file or directory" when trying to execute a bash script?
                            
                                Assigning to a positional parameter
                            
                                How to copy files as fast as possible?
                            
                                How do I ensure that Mongo binaries are in my PATH - in my shell' rc (~/.bashrc) on a Mac
                            
                                Get Subnet mask in Linux using bash
                            
                                linux shell append variable parameters to command
                            
                                How to get the path of the current directory relative to root of the git repository?
                            
                                Concatenate all arguments and wrap them with double quotes
                            
                                Why does this shell pipeline exit?
                            
                                Bash script runs manually, but fails on crontab
                            
                                Add suffix to each line with shell script
                            
                                How to grep for lines which contain particular words in a log file?
                            
                                Save command output on variable and check exit status
                            
                                Shell script substring from first indexof substring

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With