Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract a file from tar.gz, without touching disk

Tags:

bash

tar

Current Process:

  1. I have a tar.gz file. (Actually, I have about 2000 of them, but that's another story).
  2. I make a temporary directory, extract the tar.gz file, revealing 100,000 tiny files (around 600 bytes each).
  3. For each file, I cat it into a processing program, pipe that loop into another analysis program, and save the result.

The temporary space on the machines I'm using can barely handle one of these processes at once, never mind the 16 (hyperthreaded dual quad core) that they get sent by default. I'm looking for a way to do this process without saving to disk. I believe the performance penalty for individually pulling files using tar -xf $file -O <targetname> would be prohibitive, but it might be what I'm stuck with.

Is there any way of doing this?

EDIT: Since two people have already made this mistake, I'm going to clarify:

  • Each file represents one point in time.
  • Each file is processed separately.
  • Once processed (in this case a variant on Fourier analysis), each gives one line of output.
  • This output can be combined to do things like autocorrelation across time.

EDIT2: Actual code:

for f in posns/*; do
    ~/data_analysis/intermediate_scattering_function < "$f"
done | ~/data_analysis/complex_autocorrelation.awk limit=1000 > inter_autocorr.txt
like image 966
zebediah49 Avatar asked Jun 18 '12 23:06

zebediah49


3 Answers

If you do not care about the boundaries between files, then tar --to-stdout -xf $file will do what you want; it will send the contents of each file in the archive to stdout one after the other.

This assumes you are using GNU tar, which is reasonably likely if you are using bash.

[Update]

Given the constraint that you do want to process each file separately, I agree with Charles Duffy that a shell script is the wrong tool.

You could try his Python suggestion, or you could try the Archive::Tar Perl module. Either of these would allow you to iterate through the tar file's contents in memory.

like image 80
Nemo Avatar answered Sep 27 '22 17:09

Nemo


This sounds like a case where the right tool for the job is probably not a shell script. Python has a tarfile module which can operate in streaming mode, letting you make only a single pass through the large archive and process its files, while still being able to distinguish the individual files (which the tar --to-stdout approach will not).

like image 27
Charles Duffy Avatar answered Sep 27 '22 18:09

Charles Duffy


You can use the tar option --to-command=cmd to execute the command for each file. Tar redirects the file content to the standard input of the command, and sets some environment variables with details about the file, such as TAR_FILENAME. More details in Tar Documentation.

e.g.

tar zxf file.tar.gz --to-command='./process.sh'

Note that OSX uses bsdtar by default, which does not have this option. You can explicitly call gnutar instead.

like image 22
McK Avatar answered Sep 27 '22 19:09

McK