Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a gzipped tar file without using a lot of RAM?

Tags:

rust

I'm trying to create a gzipped tar file without it taking up a lot of RAM. The Bash equivalent of what I want to do is:

tar -cf - -C $INPUT . | gzip -cv - > $OUTPUT

I'm using the tar and flate2 libraries, which both say they support streaming. I cannot figure out how to stream one into the other. I have tried looking at the Write implementors, but do not see a type of stream that fits my needs.

My current implementation has the desired output (namely a .tar.gz file), but it uses up a lot of RAM, especially when the file size is large. The created file also gives "tar: Unexpected EOF in archive" when the input size is large, but is fine with small inputs. This indicates to me that it is not piping the streams as Bash would.

use flate2::write::GzEncoder;
use flate2::Compression;
use std::fs::File;
use tar::Builder;

// Create tar archive
let mut archive = Builder::new(Vec::new());
archive.append_dir_all("myfiles", "myfiles")?;

// Gzip tar archive and write to file
let compressed_file = File::create("backup.tar.gz")?;
let mut encoder = GzEncoder::new(compressed_file, Compression::Default);
encoder.write(&archive.into_inner()?)?;
encoder.finish()?;
like image 469
Nicolas Chan Avatar asked Oct 02 '17 07:10

Nicolas Chan


1 Answers

To understand why you are using RAM and why tar reports an error for large files, let's understand what exactly your code is doing:


let mut archive = Builder::new(Vec::new());

Looking at the Builder::new documentation, we can already see the main problem: "Create a new archive builder with the underlying object as the destination of all data written". Since you are passing a Vec (which implements Write), the destination of all the tar-compressed data will be written into the vector. But the vector is stored in RAM.

archive.append_dir_all("myfiles", "myfiles")?;

This line already compresses the files into the vector, so in this line, the RAM fills up.

Skipping a few lines:

encoder.write(&archive.into_inner()?)?;

Here you tell the encoder to write the vector you just filled. But, it is important to remember, that Write::write() has no guarantee how much data is written! It is a lower level building block for higher level functions which are more reliable. You want to use write_all() instead which will repeatedly call write() until all data is written. So since you're just using write(), only a part of the data is written. When you have very little data, it can usually be written all at once, but once you have more data, the bug becomes noticeable.


So what to do instead? Simple: the Builder::new() expects something that implements Write and uses that as destination. But your tar encoder does implement Write. Thus, this should work:

// Create Gzip file
let compressed_file = File::create("backup.tar.gz")?;
let mut encoder = GzEncoder::new(compressed_file, Compression::Default);

{
    // Create tar archive and compress files 
    let mut archive = Builder::new(&mut encoder);
    archive.append_dir_all("myfiles", "myfiles")?;
}

// Finish Gzip file
encoder.finish()?;
like image 101
Lukas Kalbertodt Avatar answered Oct 23 '22 15:10

Lukas Kalbertodt