My DataFrame has the following structure:
-------------------------
| Brand | type | amount|
-------------------------
| B | a | 10 |
| B | b | 20 |
| C | c | 30 |
-------------------------
I want to reduce the amount of rows by grouping type
and amount
into one single column of type: Map
So Brand
will be unique and MAP_type_AMOUNT
will have key,value
for each type
amount
combination.
I think Spark.sql might have some functions to help in this process, or do I have to get the RDD being the DataFrame and make my "own" conversion to map type?
Expected:
-------------------------
| Brand | MAP_type_AMOUNT
-------------------------
| B | {a: 10, b:20} |
| C | {c: 30} |
-------------------------
We can create a map column using createMapType() function on the DataTypes class. This method takes two arguments keyType and valueType as mentioned above and these two arguments should be of a type that extends DataType. This snippet creates “mapCol” object of type MapType with key and values as String type.
Create PySpark MapType MapType and use MapType() constructor to create a map object. MapType Key Points: The First param keyType is used to specify the type of the key in the map. The Second param valueType is used to specify the type of the value in the map.
Using concat() Function to Concatenate DataFrame Columns Spark SQL functions provide concat() to concatenate two or more DataFrame columns into a single Column. It can also take columns of different Data Types and concatenate them into a single column. for example, it supports String, Int, Boolean and also arrays.
Slight improvement to Prem's answer (sorry I can't comment yet)
Use func.create_map
instead of func.struct
. See documentation
import pyspark.sql.functions as func
df = sc.parallelize([('B','a',10),('B','b',20),
('C','c',30)]).toDF(['Brand','Type','Amount'])
df_converted = df.groupBy("Brand").\
agg(func.collect_list(func.create_map(func.col("Type"),
func.col("Amount"))).alias("MAP_type_AMOUNT"))
print df_converted.collect()
Output:
[Row(Brand=u'B', MAP_type_AMOUNT=[{u'a': 10}, {u'b': 20}]),
Row(Brand=u'C', MAP_type_AMOUNT=[{u'c': 30}])]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With