Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.

for( i <- 0 to origCols.length - 1) {   df.withColumnRenamed(     df.columns(i),      df.columns(i).toLowerCase   ); } 
like image 302
Sam Avatar asked Feb 24 '16 03:02

Sam


People also ask

How would you rename the columns of a DataFrame?

Method 1: Using rename() function One way of renaming the columns in a Pandas Dataframe is by using the rename() function. This method is quite useful when we need to rename some selected columns because we need to specify information only for the columns which are to be renamed.

Can we rename the column values of data frame?

Renaming Columns of an Existing Dataframe To rename the columns of this DataFrame , we can use the rename() method which takes: A dictionary as the columns argument containing the mapping of original column names to the new column names as a key-value pairs.


1 Answers

If structure is flat:

val df = Seq((1L, "a", "foo", 3.0)).toDF df.printSchema // root //  |-- _1: long (nullable = false) //  |-- _2: string (nullable = true) //  |-- _3: string (nullable = true) //  |-- _4: double (nullable = false) 

the simplest thing you can do is to use toDF method:

val newNames = Seq("id", "x1", "x2", "x3") val dfRenamed = df.toDF(newNames: _*)  dfRenamed.printSchema // root // |-- id: long (nullable = false) // |-- x1: string (nullable = true) // |-- x2: string (nullable = true) // |-- x3: double (nullable = false) 

If you want to rename individual columns you can use either select with alias:

df.select($"_1".alias("x1")) 

which can be easily generalized to multiple columns:

val lookup = Map("_1" -> "foo", "_3" -> "bar")  df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*) 

or withColumnRenamed:

df.withColumnRenamed("_1", "x1") 

which use with foldLeft to rename multiple columns:

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2)) 

With nested structures (structs) one possible option is renaming by selecting a whole structure:

val nested = spark.read.json(sc.parallelize(Seq(     """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}""" )))  nested.printSchema // root //  |-- foobar: struct (nullable = true) //  |    |-- foo: struct (nullable = true) //  |    |    |-- bar: struct (nullable = true) //  |    |    |    |-- first: double (nullable = true) //  |    |    |    |-- second: double (nullable = true) //  |-- id: long (nullable = true)  @transient val foobarRenamed = struct(   struct(     struct(       $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")     ).alias("point")   ).alias("location") ).alias("record")  nested.select(foobarRenamed, $"id").printSchema // root //  |-- record: struct (nullable = false) //  |    |-- location: struct (nullable = false) //  |    |    |-- point: struct (nullable = false) //  |    |    |    |-- x: double (nullable = true) //  |    |    |    |-- y: double (nullable = true) //  |-- id: long (nullable = true) 

Note that it may affect nullability metadata. Another possibility is to rename by casting:

nested.select($"foobar".cast(   "struct<location:struct<point:struct<x:double,y:double>>>" ).alias("record")).printSchema  // root //  |-- record: struct (nullable = true) //  |    |-- location: struct (nullable = true) //  |    |    |-- point: struct (nullable = true) //  |    |    |    |-- x: double (nullable = true) //  |    |    |    |-- y: double (nullable = true) 

or:

import org.apache.spark.sql.types._  nested.select($"foobar".cast(   StructType(Seq(     StructField("location", StructType(Seq(       StructField("point", StructType(Seq(         StructField("x", DoubleType), StructField("y", DoubleType))))))))) ).alias("record")).printSchema  // root //  |-- record: struct (nullable = true) //  |    |-- location: struct (nullable = true) //  |    |    |-- point: struct (nullable = true) //  |    |    |    |-- x: double (nullable = true) //  |    |    |    |-- y: double (nullable = true) 
like image 161
zero323 Avatar answered Sep 16 '22 16:09

zero323