PySpark explode stringified array of dictionaries into rows

Tags:

I have a pyspark dataframe with StringType column (edges), which contains a list of dictionaries (see example below). The dictionaries contain a mix of value types, including another dictionary (nodeIDs). I need to explode the top-level dictionaries in the edges field into rows; ideally, I should then be able to convert their component values into separate fields.

Input:

import findspark
findspark.init()

SPARK = SparkSession.builder.enableHiveSupport() \
                    .getOrCreate()

data = [
    Row(trace_uuid='aaaa', timestamp='2019-05-20T10:36:33+02:00', edges='[{"distance":4.382441320292239,"duration":1.5,"speed":2.9,"nodeIDs":{"nodeA":954752475,"nodeB":1665827480}},{"distance":14.48582171131768,"duration":2.6,"speed":5.6,"nodeIDs":{"nodeA":1665827480,"nodeB":3559056131}}]', count=156, level=36),
    Row(trace_uuid='bbbb', timestamp='2019-05-20T11:36:10+03:00', edges='[{"distance":0,"duration":0,"speed":0,"nodeIDs":{"nodeA":520686131,"nodeB":520686216}},{"distance":8.654358326561642,"duration":3.1,"speed":2.8,"nodeIDs":{"nodeA":520686216,"nodeB":506361795}}]', count=179, level=258)
    ]

df = SPARK.createDataFrame(data)

Desired output:

    data_reshaped = [
        Row(trace_uuid='aaaa', timestamp='2019-05-20T10=36=33+02=00', distance=4.382441320292239, duration=1.5, speed=2.9, nodeA=954752475, nodeB=1665827480, count=156, level=36),
        Row(trace_uuid='aaaa', timestamp='2019-05-20T10=36=33+02=00', distance=16.134844841712574, duration=2.9,speed=5.6, nodeA=1665827480, nodeB=3559056131, count=156, level=36),
        Row(trace_uuid='bbbb', timestamp='2019-05-20T11=36=10+03=00', distance=0, duration=0, speed=0, nodeA=520686131, nodeB=520686216, count=179, level=258),
        Row(trace_uuid='bbbb', timestamp='2019-05-20T11=36=10+03=00', distance=8.654358326561642, duration=3.1, speed=2.8, nodeA=520686216, nodeB=506361795, count=179, level=258)
       ]

Is there a way to do that? I've tried using cast to cast the edges field into an array first, but I can't figure out how to get it to work with the mixed data types.

I'm using Spark 2.4.0.

473

asked Jun 14 '19 00:06

SoHei

1 Answers

You can use from_json() with schema_of_json() to infer the JSON schema. for example:

from pyspark.sql import functions as F

# a sample json string:  
edges_json_sample = data[0].edges
# or edges_json_sample = df.select('edges').first()[0]

>>> edges_json_sample
#'[{"distance":4.382441320292239,"duration":1.5,"speed":2.9,"nodeIDs":{"nodeA":954752475,"nodeB":1665827480}},{"distance":14.48582171131768,"duration":2.6,"speed":5.6,"nodeIDs":{"nodeA":1665827480,"nodeB":3559056131}}]'

# infer schema from the sample string
schema = df.select(F.schema_of_json(edges_json_sample)).first()[0]

>>> schema
#u'array<struct<distance:double,duration:double,nodeIDs:struct<nodeA:bigint,nodeB:bigint>,speed:double>>'

# convert json string to data structure and then retrieve desired items
new_df = df.withColumn('data', F.explode(F.from_json('edges', schema))) \
           .select('*', 'data.*', 'data.nodeIDs.*') \
           .drop('data', 'nodeIDs', 'edges')
           
>>> new_df.show()
+-----+-----+--------------------+----------+-----------------+--------+-----+----------+----------+
|count|level|           timestamp|trace_uuid|         distance|duration|speed|     nodeA|     nodeB|
+-----+-----+--------------------+----------+-----------------+--------+-----+----------+----------+
|  156|   36|2019-05-20T10:36:...|      aaaa|4.382441320292239|     1.5|  2.9| 954752475|1665827480|
|  156|   36|2019-05-20T10:36:...|      aaaa|14.48582171131768|     2.6|  5.6|1665827480|3559056131|
|  179|  258|2019-05-20T11:36:...|      bbbb|              0.0|     0.0|  0.0| 520686131| 520686216|
|  179|  258|2019-05-20T11:36:...|      bbbb|8.654358326561642|     3.1|  2.8| 520686216| 506361795|
+-----+-----+--------------------+----------+-----------------+--------+-----+----------+----------+

# expected result
data_reshaped = new_df.rdd.collect()

173

answered Oct 23 '22 01:10

jxc

Related questions
                            
                                How to make a seed to pd.sample like np.random.seed?
                            
                                pip install latest dependency versions
                            
                                How to iterate over a large list without blocking event loop
                            
                                Adding rows for each month in a dataframe based on column date
                            
                                How to plot two variables on two different y-axes in python? [duplicate]
                            
                                How can I simplifiy this python iteration?
                            
                                How exactly does the behavior of Python bool and numpy bool_ differ?
                            
                                No legends Seaborn lineplot
                            
                                How to change result of type(object)?
                            
                                How to integrate Wikidata query in python
                            
                                Pandas rolling apply function to entire window dataframe
                            
                                Splitting on / inside a list in Python
                            
                                Add path to sys.path vs. PEP E402
                            
                                Pandas Merge and filter
                            
                                Question related to super() with __init__()
                            
                                Why do I not have to define the variable in a for loop using range(), but I do have to in a while loop in Python?
                            
                                How to crop multiple rectangles or squares from JPEG?
                            
                                How do I solve the leap year function in Python for Hackerrank?
                            
                                Read and dump [bracket, list] from and to yaml with python
                            
                                Is there a more pythonic way to write multiple comparisons

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark explode stringified array of dictionaries into rows

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

SoHei

People also ask

1 Answers

jxc

Recent Activity

Donate For Us