I'm comparing spark's parquets file vs apache-drill's. Drill's parquet are way more lightweight then spark's. Spark uses GZIP as compression codec as default, for experimenting I tried to change it to snappy : same size uncompressed: same size lzo : exception I tried both ways: <pre class="prettyprint"><code>sqlContext.sql("SET spark.sql.parquet.compression.codec=uncompressed") sqlContext.setConf("spark.sql.parquet.compression.codec.", "uncompressed") </code></pre> But seems like it dosen't change his settings

Worked for me in 2.1.1 <pre class="prettyprint"><code>df.write.option("compression","snappy").parquet(filename) </code></pre>

Spark not using spark.sql.parquet.compression.codec

Tags:

apache-spark

I'm comparing spark's parquets file vs apache-drill's. Drill's parquet are way more lightweight then spark's. Spark uses GZIP as compression codec as default, for experimenting I tried to change it to snappy : same size uncompressed: same size lzo : exception

I tried both ways:

sqlContext.sql("SET spark.sql.parquet.compression.codec=uncompressed")
sqlContext.setConf("spark.sql.parquet.compression.codec.", "uncompressed")

But seems like it dosen't change his settings

385

asked Mar 03 '16 11:03

Federico Ponzi

1 Answers

Worked for me in 2.1.1

df.write.option("compression","snappy").parquet(filename)

162

answered Oct 21 '22 15:10

ruseel

Related questions
                            
                                How to use Column.isin in Java?
                            
                                How to do mathematical operation with two column in dataframe using pyspark
                            
                                Prepend zeros to a value in PySpark
                            
                                How to get path to the uploaded file
                            
                                How to do prediction with Sklearn Model inside Spark?
                            
                                How to suppress the "Stage 2===>" from the output console in spark?
                            
                                How to handle multi line rows in spark?
                            
                                How to create a Spark UDF in Java / Kotlin which returns a complex type?
                            
                                How to do conditional "withColumn" in a Spark dataframe?
                            
                                Updating column value in loop in spark
                            
                                If data fits on a single machine does it make sense to use Spark?
                            
                                Apache Spark - working with 2 RDDs: complement of RDDs
                            
                                Spark toDebugString not nice in python
                            
                                Why Hadoop or Spark? There is ElasticSearch
                            
                                Submit & Kill Spark Application program programmatically from another application
                            
                                Access key from mapValues or flatMapValues?
                            
                                How to execute .sql file in spark using python
                            
                                Duplicate columns in Spark Dataframe
                            
                                How can I return an empty (null?) item back from a map method in PySpark?
                            
                                how to get the column names and their datatypes of parquet file using pyspark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With