How to melt Spark DataFrame?

2 Answers

There is no built-in function (if you work with SQL and Hive support enabled you can use stack function, but it is not exposed in Spark and has no native implementation) but it is trivial to roll your own. Required imports:

from pyspark.sql.functions import array, col, explode, lit, struct from pyspark.sql import DataFrame from typing import Iterable

Example implementation:

def melt(         df: DataFrame,          id_vars: Iterable[str], value_vars: Iterable[str],          var_name: str="variable", value_name: str="value") -> DataFrame:     """Convert :class:`DataFrame` from wide to long format."""      # Create array<struct<variable: str, value: ...>>     _vars_and_vals = array(*(         struct(lit(c).alias(var_name), col(c).alias(value_name))          for c in value_vars))      # Add to the DataFrame and explode     _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))      cols = id_vars + [             col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]     return _tmp.select(*cols)

And some tests (based on Pandas doctests):

import pandas as pd  pdf = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},                    'B': {0: 1, 1: 3, 2: 5},                    'C': {0: 2, 1: 4, 2: 6}})  pd.melt(pdf, id_vars=['A'], value_vars=['B', 'C'])

   A variable  value 0  a        B      1 1  b        B      3 2  c        B      5 3  a        C      2 4  b        C      4 5  c        C      6

sdf = spark.createDataFrame(pdf) melt(sdf, id_vars=['A'], value_vars=['B', 'C']).show()

+---+--------+-----+ |  A|variable|value| +---+--------+-----+ |  a|       B|    1| |  a|       C|    2| |  b|       B|    3| |  b|       C|    4| |  c|       B|    5| |  c|       C|    6| +---+--------+-----+

Note: For use with legacy Python versions remove type annotations.

r sparkR - equivalent to melt function
Gather in sparklyr

answered Sep 24 '22 19:09

zero323

Came across this question in my search for an implementation of melt in Spark for Scala.

Posting my Scala port in case someone also stumbles upon this.

import org.apache.spark.sql.functions._ import org.apache.spark.sql.{DataFrame} /** Extends the [[org.apache.spark.sql.DataFrame]] class  *  *  @param df the data frame to melt  */ implicit class DataFrameFunctions(df: DataFrame) {      /** Convert [[org.apache.spark.sql.DataFrame]] from wide to long format.      *       *  melt is (kind of) the inverse of pivot      *  melt is currently (02/2017) not implemented in spark      *      *  @see reshape packe in R (https://cran.r-project.org/web/packages/reshape/index.html)      *  @see this is a scala adaptation of http://stackoverflow.com/questions/41670103/pandas-melt-function-in-apache-spark      *        *  @todo method overloading for simple calling      *      *  @param id_vars the columns to preserve      *  @param value_vars the columns to melt      *  @param var_name the name for the column holding the melted columns names      *  @param value_name the name for the column holding the values of the melted columns      *      */      def melt(             id_vars: Seq[String], value_vars: Seq[String],              var_name: String = "variable", value_name: String = "value") : DataFrame = {          // Create array<struct<variable: str, value: ...>>         val _vars_and_vals = array((for (c <- value_vars) yield { struct(lit(c).alias(var_name), col(c).alias(value_name)) }): _*)          // Add to the DataFrame and explode         val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))          val cols = id_vars.map(col _) ++ { for (x <- List(var_name, value_name)) yield { col("_vars_and_vals")(x).alias(x) }}          return _tmp.select(cols: _*)      } }

Since I'm am not that advanced considering Scala, I'm sure there is room for improvement.

Any comments are welcome.

answered Sep 21 '22 19:09

Ahue

Related questions
                            
                                How do I detect if a Spark DataFrame has a column
                            
                                Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?
                            
                                Difference between == and === in Scala, Spark
                            
                                'PipelinedRDD' object has no attribute 'toDF' in PySpark
                            
                                Pyspark: Pass multiple columns in UDF
                            
                                Importing spark.implicits._ in scala
                            
                                Which operations preserve RDD order?
                            
                                Why does a job fail with "No space left on device", but df says otherwise?
                            
                                What is the difference between Apache Mahout and Apache Spark's MLlib?
                            
                                PySpark groupByKey returning pyspark.resultiterable.ResultIterable
                            
                                Median / quantiles within PySpark groupBy
                            
                                Upacking a list to select multiple columns from a spark data frame
                            
                                Apache Spark -- Assign the result of UDF to multiple dataframe columns
                            
                                PySpark: withColumn() with two conditions and three outcomes
                            
                                How to flatten a struct in a Spark dataframe?
                            
                                Automatically and Elegantly flatten DataFrame in Spark SQL
                            
                                How to split Vector into columns - using PySpark
                            
                                aggregate function Count usage with groupBy in Spark
                            
                                What are the various join types in Spark?
                            
                                How does Spark partition(ing) work on files in HDFS?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to melt Spark DataFrame?

Tags:

apache-spark

apache-spark-sql

pyspark

melt

Venkatesh Durgumahanthi

People also ask

2 Answers

zero323

Ahue

Recent Activity

Donate For Us