Spark 1.6 apply function to column with dot in name/ How to properly escape colName

Question

To apply a function to a column in Spark, the common way (only way?) seems to be

df.withColumn(colName, myUdf(df.col(colName))

fine, but I have columns with dots in the name, and to access thecolumn, I need to escape the name with backtick " ` "

Problem is: if I use that escaped name, the .withColumn function creates a new column with the escaped name

df.printSchema
root
 |-- raw.hourOfDay: long (nullable = false)
 |-- raw.minOfDay: long (nullable = false)
 |-- raw.dayOfWeek: long (nullable = false)
 |-- raw.sensor2: long (nullable = false)

df = df.withColumn("raw.hourOfDay", df.col("raw.hourOfDay"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "raw.hourOfDay" among (raw.hourOfDay, raw.minOfDay, raw.dayOfWeek, raw.sensor2);

this works:

df = df.withColumn("`raw.hourOfDay`", df.col("`raw.hourOfDay`"))
df: org.apache.spark.sql.DataFrame = [raw.hourOfDay: bigint, raw.minOfDay: bigint, raw.dayOfWeek: bigint, raw.sensor2: bigint, `raw.hourOfDay`: bigint]

scala> df.printSchema
root
 |-- raw.hourOfDay: long (nullable = false)
 |-- raw.minOfDay: long (nullable = false)
 |-- raw.dayOfWeek: long (nullable = false)
 |-- raw.sensor2: long (nullable = false)
 |-- `raw.hourOfDay`: long (nullable = false)

but as you see the schema has a new escaped column name.

If i do the above and attempt to drop the old column with the escaped name, it will drop the old column, but after that any attempt to access the new column results in something like:

org.apache.spark.sql.AnalysisException: Cannot resolve column name "`raw.sensor2`" among (`raw.hourOfDay`, `raw.minOfDay`, `raw.dayOfWeek`, `raw.sensor2`);

as if it's now understanding the backtick as par of the name and not an escape character.

So how do I 'replace' my old column with withColumn without changing the name?

(PS: note that my column names are parametric, so I use a loop over the names. I used specific string names here for clarity: the escape sequence really look like "`"+colName+"`" )

EDIT:

right now the only trick I found was to do:

for (t <- df.columns) {
      if (t.contains(".")) {
        df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`")))
        df = df.drop(df.col("`" + t + "`"))
        df = df.withColumnRenamed("`" + t + "`", t)
      }
      else {
        df = df.withColumn(t, myUdf(df.col(t)))
      }
    }

Not really efficient i guess...

EDIT:

The documentation state:

def withColumn(colName: String, col: Column): DataFrame
Returns a new DataFrame by adding a column 
or replacing the existing column that has the same name.

So replacing a column should not be a problem. However as pointed by @Glennie below, using a new name works fine, so this may be a bug in Spark 1.6

Pavan Challa · Accepted Answer

Thanks for the trick.

df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`")))
df = df.drop(df.col("`" + t + "`"))
df = df.withColumnRenamed("`" + t + "`", t)

It is working fine for me. Looking forward to see the better solution. Just to remind that we will be having similar issues with '#' character too.

Spark 1.6 apply function to column with dot in name/ How to properly escape colName

Tags:

scala

apache-spark

MrE

1 Answers

Pavan Challa

Recent Activity

Donate For Us

Spark 1.6 apply function to column with dot in name/ How to properly escape colName

Tags:

scala

apache-spark

MrE

1 Answers

Pavan Challa

Related questions

Recent Activity

Donate For Us