I'm comparing spark's parquets file vs apache-drill's. Drill's parquet are way more lightweight then spark's. Spark uses GZIP as compression codec as default, for experimenting I tried to change it to snappy : same size uncompressed: same size lzo : exception
I tried both ways:
sqlContext.sql("SET spark.sql.parquet.compression.codec=uncompressed")
sqlContext.setConf("spark.sql.parquet.compression.codec.", "uncompressed")
But seems like it dosen't change his settings
By default Big SQL will use SNAPPY compression when writing into Parquet tables. This means that if data is loaded into Big SQL using either the LOAD HADOOP or INSERT… SELECT commands, then SNAPPY compression is enabled by default.
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.
The default is gzip. Then create a dataframe,say Df from you data and save it using below command: Df. write. parquet("path_destination") If you check the destination folder now you will be albe to see that files have been stored with the compression type you have specified in the Step 2 above.
Parquet is built to support flexible compression options and efficient encoding schemes. As the data type for each column is quite similar, the compression of each column is straightforward (which makes queries even faster).
Worked for me in 2.1.1
df.write.option("compression","snappy").parquet(filename)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With