I am using Spark with Scala. Spark version 1.5 and I am trying to transform input dataframe which has name value combination to a new dataframe in which all name to be transposed to columns and values as rows.
I/P DataFrame:
ID  Name    Value  
1   Country US  
2   Country US  
2   State   NY  
3   Country UK  
4   Country India  
4   State   MH  
5   Country US  
5   State   NJ  
5   County  Hudson  
Link here for image
Transposed DataFrame
ID  Country State   County  
1   US      NULL    NULL  
2   US      NY      NULL  
3   UK      NULL    NULL  
4   India   MH      NULL  
5   US      NJ      Hudson  
Link to transposed image
Seems like pivot would help in this use case, but its not supported in spark 1.5.x version.
Any pointers/help?
This is a really ugly data but you can always filter and join:
val names = Seq("Country", "State", "County")
names.map(name => 
  df.where($"Name" === name).select($"ID", $"Value".alias("name"))
).reduce((df1, df2) => df1.join(df2, Seq("ID"), "leftouter"))
map creates a list of three DataFrames where each table contains records containing only a single name. Next we simply reduce this list using left outer join. So putting it all together you get something like this:
(left-outer-join
  (left-outer-join 
    (where df (=== name "Country"))
    (where df (=== name "State")))
  (where df (=== name "County")))
Note: If you use Spark >= 1.6 with Python or Scala, or Spark >= 2.0 with R, just use pivot with first:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With