Rename nested field in spark dataframe

Tags:

Having a dataframe df in Spark:

 |-- array_field: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: string (nullable = true)
 |    |    |-- b: long (nullable = true)
 |    |    |-- c: long (nullable = true)

How to rename field array_field.a to array_field.a_renamed?

[Update]:

.withColumnRenamed() does not work with nested fields so I tried this hacky and unsafe method:

Click to copy

# First alter the schema:
schema = df.schema
schema['array_field'].dataType.elementType['a'].name = 'a_renamed'

ind = schema['array_field'].dataType.elementType.names.index('a')
schema['array_field'].dataType.elementType.names[ind] = 'a_renamed'

# Then set dataframe's schema with altered schema
df._schema = schema

I know that setting a private attribute is not a good practice but I don't know other way to set the schema for df

I think I am on a right track but df.printSchema() still shows the old name for array_field.a, though df.schema == schema is True

528

asked Mar 24 '17 16:03

MaxPY

1 Answers

Python

It is not possible to modify a single nested field. You have to recreate a whole structure. In this particular case the simplest solution is to use cast.

First a bunch of imports:

Click to copy

from collections import namedtuple
from pyspark.sql.functions import col
from pyspark.sql.types import (
    ArrayType, LongType, StringType, StructField, StructType)

and example data:

Click to copy

Record = namedtuple("Record", ["a", "b", "c"])

df = sc.parallelize([([Record("foo", 1, 3)], )]).toDF(["array_field"])

Let's confirm that the schema is the same as in your case:

Click to copy

df.printSchema()

Click to copy

root
 |-- array_field: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: string (nullable = true)
 |    |    |-- b: long (nullable = true)
 |    |    |-- c: long (nullable = true)

You can define a new schema for example as a string:

Click to copy

str_schema = "array<struct<a_renamed:string,b:bigint,c:bigint>>"

df.select(col("array_field").cast(str_schema)).printSchema()

Click to copy

root
 |-- array_field: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a_renamed: string (nullable = true)
 |    |    |-- b: long (nullable = true)
 |    |    |-- c: long (nullable = true)

or a DataType:

Click to copy

struct_schema = ArrayType(StructType([
    StructField("a_renamed", StringType()),
    StructField("b", LongType()),
    StructField("c", LongType())
]))

 df.select(col("array_field").cast(struct_schema)).printSchema()

Click to copy

root
 |-- array_field: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a_renamed: string (nullable = true)
 |    |    |-- b: long (nullable = true)
 |    |    |-- c: long (nullable = true)

Scala

The same techniques can be used in Scala:

Click to copy

case class Record(a: String, b: Long, c: Long)

val df = Seq(Tuple1(Seq(Record("foo", 1, 3)))).toDF("array_field")

val strSchema = "array<struct<a_renamed:string,b:bigint,c:bigint>>"

df.select($"array_field".cast(strSchema))

Click to copy

import org.apache.spark.sql.types._

val structSchema = ArrayType(StructType(Seq(
    StructField("a_renamed", StringType),
    StructField("b", LongType),
    StructField("c", LongType)
)))

df.select($"array_field".cast(structSchema))

Possible improvements:

If you use an expressive data manipulation or JSON processing library it could be easier to dump data types to dict or JSON string and take it from there for example (Python / toolz):

Click to copy

from toolz.curried import pipe, assoc_in, update_in, map
from operator import attrgetter

# Update name to "a_updated" if name is "a"
rename_field = update_in(
    keys=["name"], func=lambda x: "a_updated" if x == "a" else x)

updated_schema = pipe(
   #  Get schema of the field as a dict
   df.schema["array_field"].jsonValue(),
   # Update fields with rename
   update_in(
       keys=["type", "elementType", "fields"],
       func=lambda x: pipe(x, map(rename_field), list)),
   # Load schema from dict
   StructField.fromJson,
   # Get data type
   attrgetter("dataType"))

df.select(col("array_field").cast(updated_schema)).printSchema()

164

answered Oct 11 '22 12:10

zero323

Related questions
                            
                                Python Add string to each line in a file
                            
                                Django Celery send register email do not work
                            
                                Django-rest-framework permissions for create in viewset
                            
                                How to make a post with a from data of empty json through HTTPie?
                            
                                Celery task always PENDING
                            
                                Draggable line with draggable points
                            
                                Equal Error Rate in Python
                            
                                How to list all unused jenkins plugins?
                            
                                Python, how to enable all warnings?
                            
                                Can't open video using opencv
                            
                                Django: show the count of related objects in admin list_display
                            
                                OSError: dlopen(libSystem.dylib, 6): image not found
                            
                                How to get boxplot data for matplotlib boxplots
                            
                                Does GridSearchCV store all the scores for all parameter combinations?
                            
                                Django and 'virtualenv' - proper project structure
                            
                                Subprocess timeout failure
                            
                                Add a new sheet to a existing workbook in python
                            
                                How to generate a unique auth token in python?
                            
                                Why is Collections.counter so slow?
                            
                                Retry function in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Rename nested field in spark dataframe

Tags:

python

dataframe

rename

apache-spark

pyspark

MaxPY

People also ask

1 Answers

zero323

Recent Activity

Donate For Us