Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel zip compression in Go

Tags:

zip

go

I am trying build a zip archive from a large number of small-medium sized files. I want to be able to do this concurrently, since compression is CPU intensive, and I'm running on a multi core server. Also I don't want to have the whole archive in memory, since its might turn out to be large.

My question is that do I have to compress every file and then combine manually combine everything together with zip header, checksum etc?

Any help would be greatly appreciated.

like image 232
Akshay Rawat Avatar asked Apr 11 '14 07:04

Akshay Rawat


1 Answers

I don't think you can combine the zip headers.

What you could do is, run the zip.Writer sequentially, in a separate goroutine, and then spawn a new goroutine for each file that you want to read, and pipe those to the goroutine that is zipping them.

This should reduce the IO overhead that you get by reading the files sequentially, although it probably won't leverage multiple cores for the archiving itself.

Here's a working example. Note that, to keep things simple,

  • it does not handle errors nicely, just panics if something goes wrong,
  • and it does not use the defer statement too much, to demonstrate the order in which things should happen.

Since defer is LIFO, it can sometimes be confusing when you stack a lot of them together.

package main

import (
    "archive/zip"
    "io"
    "os"
    "sync"
)

func ZipWriter(files chan *os.File) *sync.WaitGroup {
    f, err := os.Create("out.zip")
    if err != nil {
        panic(err)
    }
    var wg sync.WaitGroup
    wg.Add(1)
    zw := zip.NewWriter(f)
    go func() {
        // Note the order (LIFO):
        defer wg.Done() // 2. signal that we're done
        defer f.Close() // 1. close the file
        var err error
        var fw io.Writer
        for f := range files {
            // Loop until channel is closed.
            if fw, err = zw.Create(f.Name()); err != nil {
                panic(err)
            }
            io.Copy(fw, f)
            if err = f.Close(); err != nil {
                panic(err)
            }
        }
        // The zip writer must be closed *before* f.Close() is called!
        if err = zw.Close(); err != nil {
            panic(err)
        }
    }()
    return &wg
}

func main() {
    files := make(chan *os.File)
    wait := ZipWriter(files)

    // Send all files to the zip writer.
    var wg sync.WaitGroup
    wg.Add(len(os.Args)-1)
    for i, name := range os.Args {
        if i == 0 {
            continue
        }
        // Read each file in parallel:
        go func(name string) {
            defer wg.Done()
            f, err := os.Open(name)
            if err != nil {
                panic(err)
            }
            files <- f
        }(name)
    }

    wg.Wait()
    // Once we're done sending the files, we can close the channel.
    close(files)
    // This will cause ZipWriter to break out of the loop, close the file,
    // and unblock the next mutex:
    wait.Wait()
}

Usage: go run example.go /path/to/*.log.

This is the order in which things should be happening:

  1. Open output file for writing.
  2. Create a zip.Writer with that file.
  3. Kick off a goroutine listening for files on a channel.
  4. Go through each file, this can be done in one goroutine per file.
  5. Send each file to the goroutine created in step 3.
  6. After processing each file in said goroutine, close the file to free up resources.
  7. Once each file has been sent to said goroutine, close the channel.
  8. Wait until the zipping has been done (which is done sequentially).
  9. Once zipping is done (channel exhausted), the zip writer should be closed.
  10. Only when the zip writer is closed, should the output file be closed.
  11. Finally everything is closed, so close the sync.WaitGroup to tell the calling function that we're good to go. (A channel could also be used here, but sync.WaitGroup seems more elegant.)
  12. When you get the signal from the zip writer that everything is properly closed, you can exit from main and terminate nicely.

This might not answer your question, but I've been using similar code to generate zip archives on-the-fly for a web service some time ago. It performed quite well, even though the actual zipping was done in a single goroutine. Overcoming the IO bottleneck can already be an improvement.

like image 88
Attila O. Avatar answered Nov 14 '22 21:11

Attila O.