Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read Zstandard-compressed file in Spark 2.3.0

Apache Spark supposedly supports Facebook's Zstandard compression algorithm as of Spark 2.3.0 (https://issues.apache.org/jira/browse/SPARK-19112), but I am unable to actually read a Zstandard-compressed file:

$ spark-shell

...

// Short name throws an exception
scala> val events = spark.read.option("compression", "zstd").json("data.zst")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Known codecs are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.

// Codec class can be imported
scala> import org.apache.spark.io.ZStdCompressionCodec
import org.apache.spark.io.ZStdCompressionCodec

// Fully-qualified code class bypasses error, but results in corrupt records
scala> spark.read.option("compression", "org.apache.spark.io.ZStdCompressionCodec").json("data.zst")
res4: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

What do I need to do in order to read such a file?

Environment is AWS EMR 5.14.0.

like image 781
Josh Avatar asked Jun 15 '18 02:06

Josh


People also ask

Can Spark read compressed files?

Spark supports all compression formats that are supported by Hadoop.

Whats new in Spark 2. 3?

This release adds support for Continuous Processing in Structured Streaming along with a brand new Kubernetes Scheduler backend. Other major updates include the new DataSource and Structured Streaming v2 APIs, and a number of PySpark performance enhancements.

What is Apache spark framework?

Spark is an open source framework focused on interactive query, machine learning, and real-time workloads. It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, and others.


1 Answers

Per this comment, support for Zstandard in Spark 2.3.0 is limited to internal and shuffle outputs.

Reading or writing Zstandard files utilizes Hadoop's org.apache.hadoop.io.compress.ZStandardCodec, which was introduced in Hadoop 2.9.0 (2.8.3 is included in EMR 5.14.0).

like image 92
Josh Avatar answered Sep 22 '22 06:09

Josh