PySpark - Add a new nested column or change the value of existing nested columns

Tags:

pyspark

Supposing, I have a json file with lines in follow structure:

{
 "a": 1,
 "b": {
       "bb1": 1,
       "bb2": 2
      }
}

I want to change the value of key bb1 or add a new key, like: bb3. Currently, I use spark.read.json to load the json file into spark as DataFrame and df.rdd.map to map each row of RDD to dict. Then, change nested key value or add a nested key and convert the dict to row. Finally, convert RDD to DataFrame. The workflow works as follow:

Click to copy

def map_func(row):
  dictionary = row.asDict(True)
  adding new key or changing key value
  return as_row(dictionary) # as_row convert dict to row recursively

df = spark.read.json("json_file")
df.rdd.map(map_func).toDF().write.json("new_json_file")

This could work for me. But I concern that converting DataFrame -> RDD ( Row -> dict -> Row) -> DataFrame would kill the efficiency. Is there any other methods that could work for this demand but not at the cost of efficiency?

The final solution that I used is using withColumn and dynamically building the schema of b. Firstly, we can get the b_schema from df schema by:

Click to copy

b_schema = next(field['type'] for field in df.schema.jsonValue()['fields'] if field['name'] == 'b')

After that, b_schema is dict and we can add new field into it by:

Click to copy

b_schema['fields'].append({"metadata":{},"type":"string","name":"bb3","nullable":True})

And then, we could convert it to StructType by:

Click to copy

new_b = StructType.fromJson(b_schema)

In the map_func, we could convert Row to dict and populate the new field:

Click to copy

def map_func(row):
  data = row.asDict(True)
  data['bb3'] = data['bb1'] + data['bb2']
  return data

map_udf = udf(map_func, new_b)
df.withColumn('b', map_udf('b')).collect()

Thanks @Mariusz

765

asked Feb 13 '17 11:02

ryan

1 Answers

You can use map_func as udf and therefore omit converting DF -> RDD -> DF, still having the flexibility of python to implement business logic. All you need is to create schema object:

Click to copy

>>> from pyspark.sql.types import *
>>> new_b = StructType([StructField('bb1', LongType()), StructField('bb2', LongType()), StructField('bb3', LongType())])

Then you define map_func and udf:

Click to copy

>>> from pyspark.sql.functions import *
>>> def map_func(data):
...     return {'bb1': 4, 'bb2': 5, 'bb3': 6}
... 
>>> map_udf = udf(map_func, new_b)

Finally apply this UDF to dataframe:

Click to copy

>>> df = spark.read.json('sample.json')
>>> df.withColumn('b', map_udf('b')).first()
Row(a=1, b=Row(bb1=4, bb2=5, bb3=6))

EDIT:

According to the comment: You can add a field to existing StructType in a easier way, for example:

Click to copy

>>> df = spark.read.json('sample.json')
>>> new_b = df.schema['b'].dataType.add(StructField('bb3', LongType()))

answered Sep 22 '22 12:09

Mariusz

Related questions
                            
                                Spark groupBy OutOfMemory woes
                            
                                How to set the number of partitions for newAPIHadoopFile?
                            
                                How to make Spark Streaming (Spark 1.0.0) read the latest data from Kafka (Kafka Broker 0.8.1)
                            
                                Cannot deploy local Spark job, worker fails with EndPointAssociationError
                            
                                How to configure automatic restart of the application driver on Yarn
                            
                                Derby version mismatch between Spark and Hive : Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
                            
                                Spark executor lost because of time out even after setting quite long time out value 1000 seconds
                            
                                Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API
                            
                                Understanding treeReduce() in Spark
                            
                                Find name of currently running SparkContext
                            
                                What does the Spark UI light blue part of Tasks progress bar indicate?
                            
                                collect RDD with buffer in pyspark
                            
                                Spark, DataFrame: apply transformer/estimator on groups
                            
                                Spark SQL package not found
                            
                                Re-using A Schema from JSON within a Spark DataFrame using Scala
                            
                                Reading large file in Spark issue - python
                            
                                spark executor out of memory in join and reduceByKey
                            
                                Cannot load main class from JAR file
                            
                                How to do non-random Dataset splitting on Apache Spark?
                            
                                How save list to file in spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark - Add a new nested column or change the value of existing nested columns

Tags:

apache-spark

pyspark

ryan

People also ask

1 Answers

Mariusz

Recent Activity

Donate For Us