Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark not using spark.sql.parquet.compression.codec

Tags:

apache-spark

I'm comparing spark's parquets file vs apache-drill's. Drill's parquet are way more lightweight then spark's. Spark uses GZIP as compression codec as default, for experimenting I tried to change it to snappy : same size uncompressed: same size lzo : exception

I tried both ways:

sqlContext.sql("SET spark.sql.parquet.compression.codec=uncompressed")
sqlContext.setConf("spark.sql.parquet.compression.codec.", "uncompressed")

But seems like it dosen't change his settings

like image 385
Federico Ponzi Avatar asked Mar 03 '16 11:03

Federico Ponzi


People also ask

Is Parquet compressed by default?

By default Big SQL will use SNAPPY compression when writing into Parquet tables. This means that if data is loaded into Big SQL using either the LOAD HADOOP or INSERT… SELECT commands, then SNAPPY compression is enabled by default.

Can you use Spark SQL to read a Parquet data?

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

How do I compress a Parquet file?

The default is gzip. Then create a dataframe,say Df from you data and save it using below command: Df. write. parquet("path_destination") If you check the destination folder now you will be albe to see that files have been stored with the compression type you have specified in the Step 2 above.

Does Parquet use compression?

Parquet is built to support flexible compression options and efficient encoding schemes. As the data type for each column is quite similar, the compression of each column is straightforward (which makes queries even faster).


1 Answers

Worked for me in 2.1.1

df.write.option("compression","snappy").parquet(filename)
like image 162
ruseel Avatar answered Oct 21 '22 15:10

ruseel