Writing RDD partitions to individual parquet files in its own directory

Tags:

I am struggling with step where I want to write each RDD partition to separate parquet file with its own directory. Example will be:

    <root>
        <entity=entity1>
            <year=2015>
                <week=45>
                    data_file.parquet

Advantage of this format is I can use this directly in SparkSQL as columns and I will not have to repeat this data in actual file. This would be good way to get to get to specific partition without storing separate partitioning metadata someplace else.

As a preceding step I have all the data loaded from large number of gzip files and partitioned based on the above key.

Possible way would be to get each partition as separate RDD and then write it though I couldn't find any good way of doing it.

Any help will be appreciated. By the way I am new to this stack.

428

asked May 20 '15 00:05

Rajeev Prasad

1 Answers

I don't think the accepted answer appropriately answers the question.

Try something like this:

df.write.partitionBy("year", "month", "day").parquet("/path/to/output")

And you will get the partitioned directory structure.

138

answered Oct 19 '22 23:10

BAR

Related questions
                            
                                Why does Scala crash when reading my CSV?
                            
                                Scala while(true) type mismatch? Infinite loop in scala?
                            
                                Tree collections in Scala
                            
                                how to add logging function in sending and receiving action in akka
                            
                                How to access Play Framework 2.4 guice Injector in application?
                            
                                copy contents of immutable map to new mutable map [duplicate]
                            
                                How do you rotate (circular shift) of a Scala collection
                            
                                Scala variable with multiple types
                            
                                How to get the current (working) directory in Scala?
                            
                                Avoiding Scala memory leaks - Scala constructors
                            
                                Bringing Scala into my company
                            
                                Scala method to combine each element of an iterable with each element of another?
                            
                                Functional style for this Scala code
                            
                                convert Scala Future to Twitter Future
                            
                                'java.lang.AssertionError: assertion failed' Error while starting Scala-IDE(Eclipse)
                            
                                Guice and Play2 Singleton from trait
                            
                                How JVM distinguish between Scala bytecode and Java bytecode?
                            
                                Why double arrow for Scala and C# lambdas?
                            
                                How to compile and run scala code quickly in vim?
                            
                                Play! framework. template "include"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Writing RDD partitions to individual parquet files in its own directory

Tags:

scala

apache-spark

rdd

apache-spark-sql

parquet

Rajeev Prasad

People also ask

1 Answers

BAR

Recent Activity

Donate For Us