We have a pyspark dataframe with several columns containing arrays with multiple values. Our goal is to have each of this values of these columns in several rows, keeping the initial different columns. So, starting with something like this: <pre class="prettyprint"><code>data = [ ("A", ["a", "c"], ["1", "5"]), ("B", ["a", "b"], None), ("C", [], ["1"]), ] </code></pre> Whats: <pre class="prettyprint"><code>+---+------+------+ |id |list_a|list_b| +---+------+------+ |A |[a, c]|[1, 5]| |B |[a, b]|null | |C |[] |[1] | +---+------+------+ </code></pre> We would like to end up having: <pre class="prettyprint"><code>+---+----+----+ |id |col |col | +---+----+----+ |A |a |null| |A |c |null| |A |null|1 | |A |null|5 | |B |a |null| |B |b |null| |C |null|1 | +---+----+----+ </code></pre> We are thinking about several approaches: <ol> <li>prefixing each value with a column indicator, merge all the arrays into a single one, explode it and reorganize the different values into different columns</li> <li>split the dataframe into several, each one with one of these array columns, explode the array column and then, concatenating the dataframes</li> </ol> But all of them smell like dirty, complex, error prone and inefficient workarounds. Does anyone have an idea about how to solve this in an elegant manner?

In case both columns list_a and list_b could be empty, I would add a 4th case in the dataset <pre class="prettyprint"><code>data = [ ("A", ["a", "c"], ["1", "5"]), ("B", ["a", "b"], None), ("C", [], ["1"]), ("D", None, None), ] df = spark.createDataFrame(data,["id","list_a","list_b"]) </code></pre> I would then split the original df in 3 (both nulls, list_a exploded and list_b exploded) and the execute a unionByName <pre class="prettyprint"><code>dfnulls = df.filter(col("list_a").isNull() & col("list_b").isNull())\ .withColumn("list_a", lit(None))\ .withColumn("list_b", lit(None)) df1 = df\ .withColumn("list_a", explode_outer(col("list_a")))\ .withColumn("list_b", lit(None))\ .filter(~col("list_a").isNull()) df2 = df\ .withColumn("list_b", explode_outer(col("list_b")))\ .withColumn("list_a", lit(None))\ .filter(~col("list_b").isNull()) merged_df = df1.unionByName(df2).unionByName(dfnulls) merged_df.show() +---+------+------+ | id|list_a|list_b| +---+------+------+ | A| a| null| | A| c| null| | B| a| null| | B| b| null| | A| null| 1| | A| null| 5| | C| null| 1| | D| null| null| +---+------+------+ </code></pre>

Try this dynamic solution. Input: <pre class="prettyprint"><code>data = [ ("A", ["a", "c"], ["1", "5"]), ("B", ["a", "b"], None), ("C", [], ["1"]), ] df=spark.createDataFrame(data,["id","list_a","list_b"]) df.show(truncate=False) +---+------+------+ |id |list_a|list_b| +---+------+------+ |A |[a, c]|[1, 5]| |B |[a, b]|null | |C |[] |[1] | +---+------+------+ </code></pre> Lets create an array of Dataframes for each of the array columns in df. Initialize first with empty Dataframe and then override it in the for loop. For each column, explode it and for all other columns, change the datatype to string with NULL. <pre class="prettyprint"><code>from pyspark.sql.types import * array_cols=df.columns[1:] #just ignoring the ID column c=0 dfarr=[spark.createDataFrame([],schema=StructType()) for i in array_cols ] for i in array_cols: dfarr[c]=df.withColumn(i,explode(col(i))) for j in array_cols: if(i!=j): dfarr[c]=dfarr[c].withColumn(j,expr(" cast(null as string) ")) c=c+1 </code></pre> Now, dfarr is an array of dataframes with the schema like <pre class="prettyprint"><code>dfarr[0].printSchema() root |-- id: string (nullable = true) |-- list_a: string (nullable = true) |-- list_b: string (nullable = true) dfarr[1].show(truncate=False) +---+------+------+ |id |list_a|list_b| +---+------+------+ |A |null |1 | |A |null |5 | |C |null |1 | +---+------+------+ </code></pre> The datatypes in dfarr is all similar now, so just do a union of all them. For this we need the reduce function from functools <pre class="prettyprint"><code>from functools import reduce from pyspark.sql import DataFrame def unionAll(*dfs): return reduce(DataFrame.unionByName, dfs) </code></pre> Applying to our dfarr <pre class="prettyprint"><code>combo=unionAll(*dfarr) combo.show(truncate=False) +---+------+------+ |id |list_a|list_b| +---+------+------+ |A |a |null | |A |c |null | |B |a |null | |B |b |null | |A |null |1 | |A |null |5 | |C |null |1 | +---+------+------+ </code></pre>

Pyspark > Dataframe with multiple array columns into multiple rows with one value each

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

We have a pyspark dataframe with several columns containing arrays with multiple values. Our goal is to have each of this values of these columns in several rows, keeping the initial different columns. So, starting with something like this:

Click to copy

data = [
    ("A", ["a", "c"], ["1", "5"]),
    ("B", ["a", "b"], None),
    ("C", [], ["1"]),
]

Whats:

Click to copy

+---+------+------+
|id |list_a|list_b|
+---+------+------+
|A  |[a, c]|[1, 5]|
|B  |[a, b]|null  |
|C  |[]    |[1]   |
+---+------+------+

We would like to end up having:

Click to copy

+---+----+----+
|id |col |col |
+---+----+----+
|A  |a   |null|
|A  |c   |null|
|A  |null|1   |
|A  |null|5   |
|B  |a   |null|
|B  |b   |null|
|C  |null|1   |
+---+----+----+

We are thinking about several approaches:

prefixing each value with a column indicator, merge all the arrays into a single one, explode it and reorganize the different values into different columns
split the dataframe into several, each one with one of these array columns, explode the array column and then, concatenating the dataframes

But all of them smell like dirty, complex, error prone and inefficient workarounds.

Does anyone have an idea about how to solve this in an elegant manner?

229

asked Sep 07 '21 09:09

landoooo

3 Answers

In case both columns list_a and list_b could be empty, I would add a 4th case in the dataset

Click to copy

data = [
    ("A", ["a", "c"], ["1", "5"]),
    ("B", ["a", "b"], None),
    ("C", [], ["1"]),
    ("D", None, None),
]
df = spark.createDataFrame(data,["id","list_a","list_b"])

I would then split the original df in 3 (both nulls, list_a exploded and list_b exploded) and the execute a unionByName

Click to copy

dfnulls = df.filter(col("list_a").isNull() & col("list_b").isNull())\
    .withColumn("list_a", lit(None))\
    .withColumn("list_b", lit(None))

df1 = df\
    .withColumn("list_a", explode_outer(col("list_a")))\
    .withColumn("list_b", lit(None))\
    .filter(~col("list_a").isNull())

df2 = df\
    .withColumn("list_b", explode_outer(col("list_b")))\
    .withColumn("list_a", lit(None))\
    .filter(~col("list_b").isNull())

merged_df = df1.unionByName(df2).unionByName(dfnulls)

merged_df.show()

+---+------+------+
| id|list_a|list_b|
+---+------+------+
|  A|     a|  null|
|  A|     c|  null|
|  B|     a|  null|
|  B|     b|  null|
|  A|  null|     1|
|  A|  null|     5|
|  C|  null|     1|
|  D|  null|  null|
+---+------+------+

105

answered Oct 16 '22 20:10

ferran

The following approach might help you and it's based on Scala

Basically exploding the respective list columns individually and joining the datasets based on the dummy column to get the desired result.

Click to copy

import org.apache.spark.sql.functions.{explode_outer, col, lit, concat}


val df1 = inputDF
  .withColumn("list_a", explode_outer(col("list_a")))
  .withColumn("random_join_col", concat(col("id"), lit("1")))
  .drop("list_b")

val df2 = inputDF
  .withColumn("list_b", explode_outer(col("list_b")))
  .withColumn("random_join_col", concat(col("id"), lit("2")))
  .drop("list_a")


val finalDF = df1.join(df2, Seq("id", "random_join_col"), "full_outer").drop("random_join_col")

// Drop rows, if it got null value on both the list columns
finalDF.na.drop(how = "all", Seq("list_a","list_b")).orderBy("id").show(false)

answered Oct 16 '22 21:10

Sivakumar

Try this dynamic solution.

Input:

Click to copy

data = [
    ("A", ["a", "c"], ["1", "5"]),
    ("B", ["a", "b"], None),
    ("C", [], ["1"]),
]
df=spark.createDataFrame(data,["id","list_a","list_b"])
df.show(truncate=False)
+---+------+------+
|id |list_a|list_b|
+---+------+------+
|A  |[a, c]|[1, 5]|
|B  |[a, b]|null  |
|C  |[]    |[1]   |
+---+------+------+

Lets create an array of Dataframes for each of the array columns in df. Initialize first with empty Dataframe and then override it in the for loop. For each column, explode it and for all other columns, change the datatype to string with NULL.

Click to copy

from pyspark.sql.types import *
array_cols=df.columns[1:]  #just ignoring the ID column
c=0
dfarr=[spark.createDataFrame([],schema=StructType()) for i in array_cols ]
for i in array_cols:
    dfarr[c]=df.withColumn(i,explode(col(i)))
    for j in array_cols:
        if(i!=j):
            dfarr[c]=dfarr[c].withColumn(j,expr(" cast(null as string) "))
    c=c+1

Now, dfarr is an array of dataframes with the schema like

Click to copy

dfarr[0].printSchema()
root
 |-- id: string (nullable = true)
 |-- list_a: string (nullable = true)
 |-- list_b: string (nullable = true)

dfarr[1].show(truncate=False)
+---+------+------+
|id |list_a|list_b|
+---+------+------+
|A  |null  |1     |
|A  |null  |5     |
|C  |null  |1     |
+---+------+------+

The datatypes in dfarr is all similar now, so just do a union of all them. For this we need the reduce function from functools

Click to copy

from functools import reduce  
from pyspark.sql import DataFrame

def unionAll(*dfs):
    return reduce(DataFrame.unionByName, dfs)

Applying to our dfarr

Click to copy

combo=unionAll(*dfarr)

combo.show(truncate=False)
+---+------+------+
|id |list_a|list_b|
+---+------+------+
|A  |a     |null  |
|A  |c     |null  |
|B  |a     |null  |
|B  |b     |null  |
|A  |null  |1     |
|A  |null  |5     |
|C  |null  |1     |
+---+------+------+

answered Oct 16 '22 21:10

stack0114106

Related questions
                            
                                How to detect double-tap in custom component in HarmonyOS?
                            
                                Why does FastAPI execute the Pydantic constructor twice when returning from the route function?
                            
                                const inline std::map in header causes heap corruption at exit
                            
                                What does this Java G1 GC log message mean?
                            
                                TypeError: import_optional_dependency() got an unexpected keyword argument 'errors'
                            
                                Why my micro-API does not have response body?
                            
                                AngularFireModule has not been provided using v7.0.1 and new method of initializing the firebase app
                            
                                How can I list all overloaded cmdlets defined in PowerShell?
                            
                                How do I create new columns based on the values of a different column and count the percentage value of another numerical column in R?
                            
                                Using a promoted data constructor as a phantom parameter
                            
                                Message: session not created: This version of ChromeDriver only supports Chrome version 94 Current browser version is 93.0.4577.82
                            
                                Material UI 5 class name styles

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark > Dataframe with multiple array columns into multiple rows with one value each

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

landoooo

People also ask

3 Answers

ferran

Sivakumar

stack0114106

Recent Activity

Donate For Us