Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I add buffering to a gzip writer?

Tags:

gzip

go

I noticed the gzip package uses bufio internally for reading gzipped files, but not for writing them. I know that buffering is important for I/O performance, so what is the proper way to buffer a gzip writer?

// ignoring error handling for this example
outFile, _ := os.Create("output.gz")

// Alternative 1 - bufio.Writer wraps gzip.Writer
gzipWriter, _ := gzip.NewWriter(outFile)
writer, _ := bufio.NewWriter(gzipWriter)

// Alternative 2 - gzip.Writer wraps bufio.Writer
writer, _ :=  bufio.NewWriter(outFile)
gzipWriter, _ := gzip.NewWriter(writer)

// Alternative 3 - replace bufio with bytes.Buffer
buf := bytes.NewBuffer()
gzipWriter, _ := gzip.NewWriter(&buf)

Also, do I need to Flush() the gzip writer or the bufio writer (or both) before closing it, or will closing it automatically flush the writer?

UPDATE: I now understand that both reads and writes are buffered with gzip. So buffering a gzip.Writer is really double buffering. @peterSO thinks this is redundant. @Steven Weinberg thinks double buffering may reduce the number of syscalls, but suggests benchmarking to be sure.

like image 237
ryboe Avatar asked Aug 06 '14 22:08

ryboe


1 Answers

The proper way to use bufio is to wrap a writer with a high overhead for each call to write. This is the case for any writer that requires syscalls. In this case, your "outFile" is an OS file and each write is a syscall.

outFile, err := os.Create("output.gz")
defer outFile.Close()

buf := bufio.NewWriter(outFile)
defer buf.Flush()

gz := gzip.NewWriter(buf)
defer gz.Close()

io.Copy(gz, src)
return

In this case, we are grouping writes to outFile with bufio so as to avoid unnecessary syscalls. The order is src -> gzip -> buffer -> file.

Now, when we finish writing, we have multiple buffers that need to be closed. We need to tell gzip we are done so it can flush its buffers and write final information to the buffer. Then we need to tell bufio.Writer we are done so it can write out its internal buffers that it was saving for the next batch write. Finally, we need to tell the OS we are done with the file.

This destruction happens in the opposite order of creation, so we can use defers to make it easier. On return, the defers are executed in reverse order so we know we are flushing in the proper order because the defers for destruction are right next to the function calls for creation.

like image 179
Stephen Weinberg Avatar answered Nov 11 '22 02:11

Stephen Weinberg