I am using Spark with Scala. Spark version 1.5 and I am trying to transform input dataframe which has name value combination to a new dataframe in which all name to be transposed to columns and values as rows.
I/P DataFrame:
ID Name Value
1 Country US
2 Country US
2 State NY
3 Country UK
4 Country India
4 State MH
5 Country US
5 State NJ
5 County Hudson
Link here for image
Transposed DataFrame
ID Country State County
1 US NULL NULL
2 US NY NULL
3 UK NULL NULL
4 India MH NULL
5 US NJ Hudson
Link to transposed image
Seems like pivot would help in this use case, but its not supported in spark 1.5.x version.
Any pointers/help?
This is a really ugly data but you can always filter and join:
val names = Seq("Country", "State", "County")
names.map(name =>
df.where($"Name" === name).select($"ID", $"Value".alias("name"))
).reduce((df1, df2) => df1.join(df2, Seq("ID"), "leftouter"))
map
creates a list of three DataFrames
where each table contains records containing only a single name. Next we simply reduce
this list using left outer join. So putting it all together you get something like this:
(left-outer-join
(left-outer-join
(where df (=== name "Country"))
(where df (=== name "State")))
(where df (=== name "County")))
Note: If you use Spark >= 1.6 with Python or Scala, or Spark >= 2.0 with R, just use pivot with first
:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With