Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unpivot in spark-sql/pyspark

Tags:

I have a problem statement at hand wherein I want to unpivot table in spark-sql/pyspark. I have gone through the documentation and I could see there is support only for pivot but no support for un-pivot so far. Is there a way I can achieve this?

Let my initial table look like this:

Let my initial table look like this

when I pivot this in pyspark using below mentioned command:

df.groupBy("A").pivot("B").sum("C") 

I get this as the output:

After pivot table looks like this

Now I want to unpivot the pivoted table. In general this operation may/may not yield the original table based on how I've pivoted the original table.

Spark-sql as of now doesn't provide out of the box support for unpivot. Is there a way I can achieve this?

like image 665
Manish Mehra Avatar asked Feb 26 '17 06:02

Manish Mehra


People also ask

How convert columns to rows Spark SQL?

Spark SQL provides pivot() function to rotate the data from one column into multiple columns (transpose row to column). It is an aggregation where one of the grouping columns values transposed into individual columns with distinct data.

What is pivot in PySpark?

The PySpark pivot is used for the rotation of data from one Data Frame column into multiple columns. It is an aggregation function that is used for the rotation of data from one column to multiple columns in PySpark. This improves the performance of data and, conventionally, is a cheaper approach for data analysis.


1 Answers

You can use the built in stack function, for example in Scala:

scala> val df = Seq(("G",Some(4),2,None),("H",None,4,Some(5))).toDF("A","X","Y", "Z") df: org.apache.spark.sql.DataFrame = [A: string, X: int ... 2 more fields]  scala> df.show +---+----+---+----+ |  A|   X|  Y|   Z| +---+----+---+----+ |  G|   4|  2|null| |  H|null|  4|   5| +---+----+---+----+   scala> df.select($"A", expr("stack(3, 'X', X, 'Y', Y, 'Z', Z) as (B, C)")).where("C is not null").show +---+---+---+ |  A|  B|  C| +---+---+---+ |  G|  X|  4| |  G|  Y|  2| |  H|  Y|  4| |  H|  Z|  5| +---+---+---+ 

Or in pyspark:

In [1]: df = spark.createDataFrame([("G",4,2,None),("H",None,4,5)],list("AXYZ"))  In [2]: df.show() +---+----+---+----+ |  A|   X|  Y|   Z| +---+----+---+----+ |  G|   4|  2|null| |  H|null|  4|   5| +---+----+---+----+  In [3]: df.selectExpr("A", "stack(3, 'X', X, 'Y', Y, 'Z', Z) as (B, C)").where("C is not null").show() +---+---+---+ |  A|  B|  C| +---+---+---+ |  G|  X|  4| |  G|  Y|  2| |  H|  Y|  4| |  H|  Z|  5| +---+---+---+ 
like image 109
Andrew Ray Avatar answered Sep 17 '22 21:09

Andrew Ray