Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark map type contains duplicate keys

Could someone help me understand why the map type in pyspark could contain duplicate keys?

An example would be:

# generate a sample dataframe
# the `field` column is an array of struct with value a and value b
# the goal is to create a map from a -> b 

df = spark.createDataFrame([{
    'field': [Row(a=1, b=2), Row(a=1, b=3)],
}])


# above code would generate a dataframe like this
+----------------+
|           field|
+----------------+
|[[1, 2], [1, 3]]|
+----------------+

# with schema
root
 |-- field: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: long (nullable = true)

Then I applied map_from_entries on this dataframe, trying to collect unique a->b pairs. I was expecting the map to contain unique keys, that is {1 -> 3} in this case. However, I'm getting {1 -> 2, 1 -> 3} before collecting. This contradict the common idea of a map type.

import pyspark.sql.functions as F
df.select(F.map_from_entries("field"))

# the result is
+-----------------------+
|map_from_entries(field)|
+-----------------------+
|       [1 -> 2, 1 -> 3]|
+-----------------------+

I also tried to apply F.map_keys() on this field and got [1, 1] as the result. However, when I collect this dataframe, I was able to get the result without duplicate keys:

df.select(F.map_from_entries("field")).collect()

# result
[Row(map_from_entries(field)={1: 3})]

This is causing some unexpected behavior in my spark job, and I would really appreciate if someone could help me understand why pyspark is behaving in this way. Is this a bug or by design?

like image 415
Ed Ding Avatar asked Sep 11 '25 09:09

Ed Ding


1 Answers

It goes back to the implementation of maps in Scala: https://www.scala-lang.org/api/2.12.2/scala/collection/immutable/List.html#toMap[T,U]:scala.collection.Map[T,U]

Duplicate keys will be overwritten by later keys: if this is an unordered collection, which key is in the resulting map is undefined

Therefore the map 1->3 overwrites 1->2. This is the designed behaviour and not a bug.

like image 194
mck Avatar answered Sep 13 '25 00:09

mck