Dataflow GZIP TextIO ZipException: too many length or distance symbols

Question

Using TextIO.Read transform with a large collection of compressed text files (1000+ files, sizes between 100MB and 1.5GB), we sometimes get the following error:

java.util.zip.ZipException: too many length or distance symbols at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at
java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
java.io.FilterInputStream.read(FilterInputStream.java:133) at
java.io.PushbackInputStream.read(PushbackInputStream.java:186) at 
com.google.cloud.dataflow.sdk.runners.worker.TextReader$ScanState.readBytes(TextReader.java:261) at 
com.google.cloud.dataflow.sdk.runners.worker.TextReader$TextFileIterator.readElement(TextReader.java:189) at 
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.computeNextElement(FileBasedReader.java:265) at 
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.hasNext(FileBasedReader.java:165) at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:169) at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:118) at 
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:66) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:204) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:151) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:118) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:139) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:124) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
java.lang.Thread.run(Thread.java:745)

Searching online for the same ZipException, only lead to this reply:

Zip file errors often happen when the hot deployer attempts to deploy an application before it is fully copied to the deploy directory. This is fairly common if it takes several seconds to copy the file. The solution is to copy the file to a temporary directory on the same disk partition as the application server, and then move the file to the deploy directory.

Did anybody else run into a similar exception? Or anyway we can fix this problem?

Ivan Tarasov · Accepted Answer

Looking at the code that produces the error message it seems to be a problem with zlib library (which is used by JDK) not supporting the format of gzip files that you have.

It looks to be the following bug in zlib: Codes for reserved symbols are rejected even if unused.

Unfortunately there's probably little we can do to help other than suggest producing these compressed file using another utility.

If you can produce a small example gzip file that we could use to reproduce the issue, we might be able to see if it is possible to work around somehow, but I wouldn't rely on this to succeed.

AaronM · Answer

This question may be a bit old, but it was the first result in my Google search yesterday for this error:

HIVE_CURSOR_ERROR: too many length or distance symbols

After the tips here, I came to the realization that I had botched my gzip construction of the files I was trying to process. I had two processes writing gzip'd data out to the same output file, and the output files were corrupt because of it. Fixing the processes to write to unique files resolved the issue. I thought this answer might save another some time.

Dataflow GZIP TextIO ZipException: too many length or distance symbols

Tags:

java

gzipinputstream

google-cloud-dataflow

Fematich

2 Answers

Ivan Tarasov

AaronM

Recent Activity

Donate For Us

Dataflow GZIP TextIO ZipException: too many length or distance symbols

Tags:

java

gzipinputstream

google-cloud-dataflow

Fematich

2 Answers

Ivan Tarasov

AaronM

Related questions

Recent Activity

Donate For Us