Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collapse a Spark DataFrame

I am using Spark with Scala. Spark version 1.5 and I am trying to transform input dataframe which has name value combination to a new dataframe in which all name to be transposed to columns and values as rows.

I/P DataFrame:

ID  Name    Value  
1   Country US  
2   Country US  
2   State   NY  
3   Country UK  
4   Country India  
4   State   MH  
5   Country US  
5   State   NJ  
5   County  Hudson  

Link here for image

Transposed DataFrame

ID  Country State   County  
1   US      NULL    NULL  
2   US      NY      NULL  
3   UK      NULL    NULL  
4   India   MH      NULL  
5   US      NJ      Hudson  

Link to transposed image
Seems like pivot would help in this use case, but its not supported in spark 1.5.x version.

Any pointers/help?

like image 614
Learner Avatar asked Sep 26 '22 05:09

Learner


1 Answers

This is a really ugly data but you can always filter and join:

val names = Seq("Country", "State", "County")

names.map(name => 
  df.where($"Name" === name).select($"ID", $"Value".alias("name"))
).reduce((df1, df2) => df1.join(df2, Seq("ID"), "leftouter"))

map creates a list of three DataFrames where each table contains records containing only a single name. Next we simply reduce this list using left outer join. So putting it all together you get something like this:

(left-outer-join
  (left-outer-join 
    (where df (=== name "Country"))
    (where df (=== name "State")))
  (where df (=== name "County")))

Note: If you use Spark >= 1.6 with Python or Scala, or Spark >= 2.0 with R, just use pivot with first:

  • Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames
  • How to pivot DataFrame?
like image 178
zero323 Avatar answered Sep 29 '22 08:09

zero323