Pyspark: explode json in column to multiple columns

Tags:

The data looks like this -

+-----------+-----------+-----------------------------+ |         id|      point|                         data| +-----------------------------------------------------+ |        abc|          6|{"key1":"124", "key2": "345"}| |        dfl|          7|{"key1":"777", "key2": "888"}| |        4bd|          6|{"key1":"111", "key2": "788"}|

I am trying to break it into the following format.

Click to copy

+-----------+-----------+-----------+-----------+ |         id|      point|       key1|       key2| +------------------------------------------------ |        abc|          6|        124|        345| |        dfl|          7|        777|        888| |        4bd|          6|        111|        788|

The explode function explodes the dataframe into multiple rows. But that is not the desired solution.

Note: This solution does not answers my questions. PySpark "explode" dict in column

513

asked Jun 27 '18 19:06

sjishan

1 Answers

As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json should get you your desired result, but you would need to first define the required schema

Click to copy

from pyspark.sql.functions import from_json, col from pyspark.sql.types import StructType, StructField, StringType  schema = StructType(     [         StructField('key1', StringType(), True),         StructField('key2', StringType(), True)     ] )  df.withColumn("data", from_json("data", schema))\     .select(col('id'), col('point'), col('data.*'))\     .show()

which should give you

Click to copy

+---+-----+----+----+ | id|point|key1|key2| +---+-----+----+----+ |abc|    6| 124| 345| |df1|    7| 777| 888| |4bd|    6| 111| 788| +---+-----+----+----+

answered Oct 14 '22 19:10

Ramesh Maharjan

Related questions
                            
                                How to build many-to-many relations using SQLAlchemy: a good example
                            
                                Saving numpy array in mongodb
                            
                                set_data and autoscale_view matplotlib
                            
                                How to clear cookies using Django
                            
                                Date object with year and month only
                            
                                How to access the first and the last elements in a dictionary?
                            
                                Animating "growing" line plot in Python/Matplotlib
                            
                                How to convert pandas single column data frame to series or numpy vector [duplicate]
                            
                                Schrödinger's variable: the __class__ cell magically appears if you're checking for its presence?
                            
                                numpy array concatenate: "ValueError: all the input arrays must have same number of dimensions"
                            
                                How to pass and parse a list of strings from command line with argparse.ArgumentParser in Python?
                            
                                Adding a new column in pandas dataframe from another dataframe with differing indices
                            
                                R's which() and which.min() Equivalent in Python
                            
                                Python: Why is comparison between lists and tuples not supported?
                            
                                Get formula from Excel cell with python xlrd
                            
                                How to do weighted random sample of categories in python
                            
                                Can I get the local variables of a Python function from which an exception was thrown?
                            
                                Truncating a string in python
                            
                                Using Argparse and Json together
                            
                                Can't use /= on numpy array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark: explode json in column to multiple columns

Tags:

python

apache-spark

apache-spark-sql

pyspark

sjishan

People also ask

1 Answers

Ramesh Maharjan

Recent Activity

Donate For Us