The data looks like this -
+-----------+-----------+-----------------------------+ | id| point| data| +-----------------------------------------------------+ | abc| 6|{"key1":"124", "key2": "345"}| | dfl| 7|{"key1":"777", "key2": "888"}| | 4bd| 6|{"key1":"111", "key2": "788"}|
I am trying to break it into the following format.
+-----------+-----------+-----------+-----------+ | id| point| key1| key2| +------------------------------------------------ | abc| 6| 124| 345| | dfl| 7| 777| 888| | 4bd| 6| 111| 788|
The explode
function explodes the dataframe into multiple rows. But that is not the desired solution.
Note: This solution does not answers my questions. PySpark "explode" dict in column
The explode() function breaks a string into an array.
Returns a new row for each element in the given array or map. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.
As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json
should get you your desired result, but you would need to first define the required schema
from pyspark.sql.functions import from_json, col from pyspark.sql.types import StructType, StructField, StringType schema = StructType( [ StructField('key1', StringType(), True), StructField('key2', StringType(), True) ] ) df.withColumn("data", from_json("data", schema))\ .select(col('id'), col('point'), col('data.*'))\ .show()
which should give you
+---+-----+----+----+ | id|point|key1|key2| +---+-----+----+----+ |abc| 6| 124| 345| |df1| 7| 777| 888| |4bd| 6| 111| 788| +---+-----+----+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With