I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same. <pre class="prettyprint lang-none prettyprint-override"><code>Name Age Subjects Grades [Bob] [16] [Maths,Physics,Chemistry] [A,B,C] </code></pre> I want to explode the dataframe in such a way that i get the following output- <pre class="prettyprint lang-none prettyprint-override"><code>Name Age Subjects Grades Bob 16 Maths A Bob 16 Physics B Bob 16 Chemistry C </code></pre> How can I achieve this?

This works, <pre class="prettyprint"><code>import pyspark.sql.functions as F from pyspark.sql.types import * df = sql.createDataFrame( [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])], ['Name','Age','Subjects', 'Grades']) df.show() +-----+----+--------------------+---------+ | Name| Age| Subjects| Grades| +-----+----+--------------------+---------+ |[Bob]|[16]|[Maths, Physics, ...|[A, B, C]| +-----+----+--------------------+---------+ </code></pre> Use <code>udf</code> with <code>zip</code>. Those columns needed to <code>explode</code> have to be merged before exploding. <pre class="prettyprint"><code>combine = F.udf(lambda x, y: list(zip(x, y)), ArrayType(StructType([StructField("subs", StringType()), StructField("grades", StringType())]))) df = df.withColumn("new", combine("Subjects", "Grades"))\ .withColumn("new", F.explode("new"))\ .select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades")) df.show() +-----+----+---------+------+ | Name| Age| Subjects|Grades| +-----+----+---------+------+ |[Bob]|[16]| Maths| A| |[Bob]|[16]| Physics| B| |[Bob]|[16]|Chemistry| C| +-----+----+---------+------+ </code></pre>

How to explode multiple columns of a dataframe in pyspark

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.

Name  Age  Subjects                  Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C]

I want to explode the dataframe in such a way that i get the following output-

Name Age Subjects Grades
Bob  16   Maths     A
Bob  16  Physics    B
Bob  16  Chemistry  C

How can I achieve this?

259

asked Jun 28 '18 12:06

Visualisation App

2 Answers

PySpark has added an arrays_zip function in 2.4, which eliminates the need for a Python UDF to zip the arrays.

import pyspark.sql.functions as F
from pyspark.sql.types import *

df = sql.createDataFrame(
    [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
    ['Name','Age','Subjects', 'Grades'])
df = df.withColumn("new", F.arrays_zip("Subjects", "Grades"))\
       .withColumn("new", F.explode("new"))\
       .select("Name", "Age", F.col("new.Subjects").alias("Subjects"), F.col("new.Grades").alias("Grades"))
df.show()

+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]|    Maths|     A|
|[Bob]|[16]|  Physics|     B|
|[Bob]|[16]|Chemistry|     C|
+-----+----+---------+------+

161

answered Oct 24 '22 05:10

abeboparebop

This works,

import pyspark.sql.functions as F
from pyspark.sql.types import *

df = sql.createDataFrame(
    [(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
    ['Name','Age','Subjects', 'Grades'])
df.show()

+-----+----+--------------------+---------+
| Name| Age|            Subjects|   Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+

Use udf with zip. Those columns needed to explode have to be merged before exploding.

combine = F.udf(lambda x, y: list(zip(x, y)),
              ArrayType(StructType([StructField("subs", StringType()),
                                    StructField("grades", StringType())])))

df = df.withColumn("new", combine("Subjects", "Grades"))\
       .withColumn("new", F.explode("new"))\
       .select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()


+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]|    Maths|     A|
|[Bob]|[16]|  Physics|     B|
|[Bob]|[16]|Chemistry|     C|
+-----+----+---------+------+

answered Oct 24 '22 04:10

mayank agrawal

Related questions
                            
                                AttributeError: 'module' object has no attribute 'reader' [duplicate]
                            
                                <bound method Response.json of <Response [200]>>
                            
                                Pandas groupby quantile values
                            
                                ValueError: Duplicate plugins for name projector
                            
                                python empty argument
                            
                                Django unable to find MySQLdb python module
                            
                                Find and replace within a text file using Python
                            
                                Unable to reverse lists in Python, getting "Nonetype" as list
                            
                                what does .dtype do?
                            
                                pyodbc - How to perform a select statement using a variable for a parameter
                            
                                IndentationError expected an indented block
                            
                                Converting Matlab's datenum format to Python
                            
                                Is it possible to change a function's default parameters in Python?
                            
                                Emmet - Notepad++ "Unknown exception"
                            
                                How can I merge all the videos in a folder to make a single video file using FFMPEG
                            
                                install python module using a zip file
                            
                                What is the Python equivalent of Perl's ucfirst() or s///e?
                            
                                Skimage: how to show image
                            
                                if else function in pandas dataframe [duplicate]
                            
                                Python order dataframe alphabetically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With