I have some data which has timestamp column field which is long and its epoch standard , I need to save that data in split-ted format like yyyy/mm/dd/hh using spark scala <pre class="prettyprint"><code>data.write.partitionBy("timestamp").format("orc").save("mypath") </code></pre> this is just splitting the data by timestamp like below <pre class="prettyprint"><code>timestamp=1458444061098 timestamp=1458444061198 </code></pre> but I want it to be as <pre class="prettyprint"><code>└── YYYY └── MM └── DD └── HH </code></pre>

You can leverage various spark sql date/time functions for this. First, you add a new date type column created from the unix timestamp column. <pre class="prettyprint"><code>val withDateCol = data .withColumn("date_col", from_unixtime(col("timestamp"), "YYYYMMddHH")) </code></pre> After this, you can add year, month, day and hour columns to the DF and then partition by these new columns for the write. <pre class="prettyprint"><code>withDateCol .withColumn("year", year(col("date_col"))) .withColumn("month", month(col("date_col"))) .withColumn("day", dayofmonth(col("date_col"))) .withColumn("hour", hour(col("date_col"))) .drop("date_col") .partitionBy("year", "month", "day", "hour") .format("orc") .save("mypath") </code></pre> The columns included in the partitionBy clause wont be part of the file schema.

spark partition data writing by timestamp

Tags:

scala

apache-spark

apache-spark-sql

I have some data which has timestamp column field which is long and its epoch standard , I need to save that data in split-ted format like yyyy/mm/dd/hh using spark scala

data.write.partitionBy("timestamp").format("orc").save("mypath")

this is just splitting the data by timestamp like below

timestamp=1458444061098
timestamp=1458444061198

but I want it to be as

└── YYYY
    └── MM
        └── DD
            └── HH

459

asked Sep 27 '18 00:09

kcoder

1 Answers

You can leverage various spark sql date/time functions for this. First, you add a new date type column created from the unix timestamp column.

val withDateCol = data
.withColumn("date_col", from_unixtime(col("timestamp"), "YYYYMMddHH"))

After this, you can add year, month, day and hour columns to the DF and then partition by these new columns for the write.

withDateCol
.withColumn("year", year(col("date_col")))
.withColumn("month", month(col("date_col")))
.withColumn("day", dayofmonth(col("date_col")))
.withColumn("hour", hour(col("date_col")))
.drop("date_col")
.partitionBy("year", "month", "day", "hour")
.format("orc")
.save("mypath")

The columns included in the partitionBy clause wont be part of the file schema.

answered Sep 20 '22 11:09

Constantine

Related questions
                            
                                Reflection on a Scala case class
                            
                                Casting String to Long in Scala in template play 2.0 template
                            
                                Scala Play! Using anorm or ORM
                            
                                Scala - initialization order of vals
                            
                                Play 2.0 / SBT: Exclude certain transitive dependencies from some/all modules in Build.scala
                            
                                How to escape a string for use in XML/HTML document in Scala?
                            
                                Scala's wrong forward reference error
                            
                                How to read query parameters in akka-http?
                            
                                Could you share a link to an URL parsing implementation? [closed]
                            
                                Conditional variable setting
                            
                                Multiline regex capture in Scala
                            
                                Understanding Case class and Traits in Scala
                            
                                Is there a Haskell equivalent of Scala's Iterable.maxBy?
                            
                                Play JSON formatter for Map[Int,_]
                            
                                Apache Spark error while start
                            
                                Scala/Spark version compatibility
                            
                                What does Mirah offer over JRuby,Groovy and Scala?
                            
                                Return tail if list is not empty or return empty list
                            
                                Type casting using type parameter
                            
                                Can compile Scala programs but can't run them

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With