Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing Spark dataframe in ORC format with Snappy compression

I am successfull in reading a text file stored in S3 and writing it back to S3 in ORC format using Spark dataframes. - inputDf.write().orc(outputPath);
What I am not able to do is convert to ORC format with snappy compression. I already tried giving option while writing as setting the codec to snappy but Spark is still writing as normal ORC. How to achieve writing in ORC format with Snappy compression to S3 using Spark Dataframes?

like image 579
abstractKarshit Avatar asked Oct 22 '25 12:10

abstractKarshit


1 Answers

For anyone facing the same issue, in Spark 2.0 this is possible by default. The default compression format for ORC is set to snappy.

public class ConvertToOrc {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("OrcConvert")
                .getOrCreate();
        String inputPath = args[0];
        String outputPath = args[1];

        Dataset<Row> inputDf = spark.read().option("sep", "\001").option("quote", "'").csv(inputPath);
        inputDf.write().format("orc").save(outputPath);

   }
}
like image 151
abstractKarshit Avatar answered Oct 25 '25 06:10

abstractKarshit



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!