Replace elements in an array with their corresponding elements in PySpark

Question

I have this dataframe:

+-----+---------------------+
|Index|flagArray            |
+-----+---------------------+
|1    |[A, S, A, E, Z, S, S]|
|2    |[A, Z, Z, E, Z, S, S]|
+-----+---------------------+

I want to represent array elements with their corresponding numeric values.

A - 0
F - 1
S - 2
E - 3
Z - 4

So the output dataframe should look like this:

+-----+---------------------+---------------------+
|Index|flagArray            |finalArray           |
+-----+---------------------+---------------------+
|1    |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
|2    |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
+-----+---------------------+---------------------+

I have written a udf in PySpark where I am achieving it by writing some if else statements. Is there any better way to handle this?

blackbishop · Accepted Answer

For Spark 2.4+, you can simply use transform function to loop through each element of flagArray array and get its mapping value from a map column that you can create from that mapping using element_at:

mappings = {"A": 0, "F": 1, "S": 2, "E": 3, "Z": 4}
mapping_col = map_from_entries(array(*[struct(lit(k), lit(v)) for k, v in mappings.items()]))

df = df.withColumn("mappings", mapping_col) \
       .withColumn("finalArray", expr(""" transform(flagArray, x -> element_at(mappings, x))""")) \
       .drop("mappings")

df.show(truncate=False)
#+-----+---------------------+---------------------+
#|Index|flagArray            |finalArray           |
#+-----+---------------------+---------------------+
#|1    |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
#|2    |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
#+-----+---------------------+---------------------+

ernest_k · Answer

There doesn't seem to be a built-in function to map array elements, so here's perhaps an alternative udf, different from yours in that it uses a list comprehension:

dic = {'A':0,'F':1,'S':2,'E':3,'Z':4}
map_array = f.udf(lambda a: [dic[k] for k in a])
df.withColumn('finalArray', map_array(df['flagArray'])).show(truncate=False)

Output:

+------+---------------------+---------------------+
|Index |flagArray            |finalArray           |
+------+---------------------+---------------------+
|1     |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
|2     |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
+------+---------------------+---------------------+

Replace elements in an array with their corresponding elements in PySpark

Tags:

arrays

replace

apache-spark

apache-spark-sql

pyspark

Saikat

2 Answers

blackbishop

ernest_k

Recent Activity

Donate For Us

Replace elements in an array with their corresponding elements in PySpark

Tags:

arrays

replace

apache-spark

apache-spark-sql

pyspark

Saikat

2 Answers

blackbishop

ernest_k

Related questions

Recent Activity

Donate For Us