How to zip two array columns in Spark SQL

Tags:

I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below:

df['column_1']: 'abc, def, ghi'
df['column_2']: '1.0, 2.0, 3.0'

I wanted to join these two columns in a third column like below for each row of my dataframe.

df['column_3']: [abc_1.0, def_2.0, ghi_3.0]

I have successfully done so in python using the code below but the dataframe is quite large and it takes a very long time to run it for the whole dataframe. I want to do the same thing in PySpark for efficiency. I have read the data in spark dataframe successfully but I'm having a hard time determining how to replicate Pandas functions with PySpark equivalent functions. How can I get my desired result in PySpark?

df['column_3'] = df['column_2']
for index, row in df.iterrows():
  while index < 3:
    if isinstance(row['column_1'], str):      
      row['column_1'] = list(row['column_1'].split(','))
      row['column_2'] = list(row['column_2'].split(','))
      row['column_3'] = ['_'.join(map(str, i)) for i in zip(list(row['column_1']), list(row['column_2']))]

I have converted the two columns to arrays in PySpark by using the below code

from pyspark.sql.types import ArrayType, IntegerType, StringType
from pyspark.sql.functions import col, split

crash.withColumn("column_1",
    split(col("column_1"), ",\s*").cast(ArrayType(StringType())).alias("column_1")
)
crash.withColumn("column_2",
    split(col("column_2"), ",\s*").cast(ArrayType(StringType())).alias("column_2")
)

Now all I need is to zip each element of the arrays in the two columns using '_'. How can I use zip with this? Any help is appreciated.

637

asked Jan 21 '19 02:01

Falconic

Video Answer

2 Answers

A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:

pyspark.sql.functions.arrays_zip(*cols)

Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

So if you already have two arrays:

from pyspark.sql.functions import split

df = (spark
    .createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
    .toDF("column_1", "column_2")
    .withColumn("column_1", split("column_1", "\s*,\s*"))
    .withColumn("column_2", split("column_2", "\s*,\s*")))

You can just apply it on the result

from pyspark.sql.functions import arrays_zip

df_zipped = df.withColumn(
  "zipped", arrays_zip("column_1", "column_2")
)

df_zipped.select("zipped").show(truncate=False)

+------------------------------------+
|zipped                              |
+------------------------------------+
|[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
+------------------------------------+

Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):

df_zipped_concat = df_zipped.withColumn(
    "zipped_concat",
     expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
) 

df_zipped_concat.select("zipped_concat").show(truncate=False)

+---------------------------+
|zipped_concat              |
+---------------------------+
|[abc_1.0, def_2.0, ghi_3.0]|
+---------------------------+

Note:

Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.

answered Sep 21 '22 14:09

10465355

You can also UDF to zip the split array columns,

df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2']) 
+-----------+-----------+
|col1       |col2       |
+-----------+-----------+
|abc,def,ghi|1.0,2.0,3.0|
+-----------+-----------+ ## Hope this is how your dataframe is

from pyspark.sql import functions as F
from pyspark.sql.types import *

def concat_udf(*args):
    return ['_'.join(x) for x in zip(*args)]

udf1 = F.udf(concat_udf,ArrayType(StringType()))
df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
df.show(1,False)
+-----------+-----------+---------------------------+
|col1       |col2       |col3                       |
+-----------+-----------+---------------------------+
|abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
+-----------+-----------+---------------------------+

answered Sep 18 '22 14:09

Suresh

Related questions
                            
                                tabula-py ImportError: cannot import name 'read_pdf'
                            
                                Change column type from string to date in Pyspark
                            
                                How to use `cv2.imshow` correctly for the float image returned by `cv2.distanceTransform`?
                            
                                Django 2.0 ModelForm dateField not displaying as a widget
                            
                                CNN with keras, accuracy not improving
                            
                                Multi-row UPSERT (INSERT or UPDATE) from Python
                            
                                Create an array with a pre determined mean and standard deviation
                            
                                Removing case sensitivity from Email in Django login form
                            
                                Difference between null=True and on_delete=models.SET_NULL django
                            
                                how to convert a string datetime with unknown timezone to timestamp in python
                            
                                'Collection' object is not callable for both insert() and insert_one() method of a pymongo Database
                            
                                pandas groupby count the number of zeros in a column
                            
                                Comparing two different python counter objects
                            
                                flask is working inside container but not when I port forward it
                            
                                Why do I get a list of errors when trying to load matplotlib in PyCharm with the Anaconda environment?
                            
                                python: "with" syntax for opening files with two functions
                            
                                Adding image to pandas DataFrame
                            
                                TensorFlow library was compiled to use SSE4.1 instructions, but these aren't available on your machine. Aborted (core dumped)
                            
                                pip install pandas failed because of environment error
                            
                                How to convert a timedelta to minutes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to zip two array columns in Spark SQL

Tags:

python

pandas

apache-spark

apache-spark-sql

pyspark