Read Zstandard-compressed file in Spark 2.3.0

Tags:

Apache Spark supposedly supports Facebook's Zstandard compression algorithm as of Spark 2.3.0 (https://issues.apache.org/jira/browse/SPARK-19112), but I am unable to actually read a Zstandard-compressed file:

$ spark-shell

...

// Short name throws an exception
scala> val events = spark.read.option("compression", "zstd").json("data.zst")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Known codecs are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.

// Codec class can be imported
scala> import org.apache.spark.io.ZStdCompressionCodec
import org.apache.spark.io.ZStdCompressionCodec

// Fully-qualified code class bypasses error, but results in corrupt records
scala> spark.read.option("compression", "org.apache.spark.io.ZStdCompressionCodec").json("data.zst")
res4: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

What do I need to do in order to read such a file?

Environment is AWS EMR 5.14.0.

781

asked Jun 15 '18 02:06

Josh

1 Answers

Per this comment, support for Zstandard in Spark 2.3.0 is limited to internal and shuffle outputs.

Reading or writing Zstandard files utilizes Hadoop's org.apache.hadoop.io.compress.ZStandardCodec, which was introduced in Hadoop 2.9.0 (2.8.3 is included in EMR 5.14.0).

answered Sep 22 '22 06:09

Josh

Related questions
                            
                                Combining Spark Streaming + MLlib
                            
                                Read Kafka topic in a Spark batch job
                            
                                PySpark: retrieve mean and the count of values around the mean for groups within a dataframe
                            
                                Running Spark on Linux : $JAVA_HOME not set error
                            
                                Inspecting GraphX Graph Object
                            
                                GroupByKey with datasets in Spark 2.0 using Java
                            
                                Outlier detection algorithm spark mllib
                            
                                Hadoop Yarn: How to limit dynamic self allocation of resources with Spark?
                            
                                How to make Spark driver resilient to Master restarts?
                            
                                spark: SAXParseException while writing to parquet on s3
                            
                                How to use "cube" only for specific fields on Spark dataframe?
                            
                                Spark: graphx api OOM errors after unpersist useless RDDs
                            
                                How does back pressure property work in Spark Streaming?
                            
                                Spark Shell with Yarn - Error: Yarn application has already ended! It might have been killed or unable to launch application master
                            
                                How to split comma separated string and get n values in Spark Scala dataframe?
                            
                                How to connect with JMX remotely to Spark worker on Dataproc
                            
                                how to write spark custom data source based on FileFormat
                            
                                What causes "unknown resolver null" in Spark Kafka Connector?
                            
                                Is manually managing memory with .unpersist() a good idea?
                            
                                maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read Zstandard-compressed file in Spark 2.3.0

Tags:

apache-spark

hadoop2

amazon-emr

zstandard

Josh

People also ask

1 Answers

Josh

Recent Activity

Donate For Us