Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 1.6 apply function to column with dot in name/ How to properly escape colName

To apply a function to a column in Spark, the common way (only way?) seems to be

df.withColumn(colName, myUdf(df.col(colName))

fine, but I have columns with dots in the name, and to access thecolumn, I need to escape the name with backtick " ` "

Problem is: if I use that escaped name, the .withColumn function creates a new column with the escaped name

df.printSchema
root
 |-- raw.hourOfDay: long (nullable = false)
 |-- raw.minOfDay: long (nullable = false)
 |-- raw.dayOfWeek: long (nullable = false)
 |-- raw.sensor2: long (nullable = false)

df = df.withColumn("raw.hourOfDay", df.col("raw.hourOfDay"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "raw.hourOfDay" among (raw.hourOfDay, raw.minOfDay, raw.dayOfWeek, raw.sensor2);

this works:

df = df.withColumn("`raw.hourOfDay`", df.col("`raw.hourOfDay`"))
df: org.apache.spark.sql.DataFrame = [raw.hourOfDay: bigint, raw.minOfDay: bigint, raw.dayOfWeek: bigint, raw.sensor2: bigint, `raw.hourOfDay`: bigint]

scala> df.printSchema
root
 |-- raw.hourOfDay: long (nullable = false)
 |-- raw.minOfDay: long (nullable = false)
 |-- raw.dayOfWeek: long (nullable = false)
 |-- raw.sensor2: long (nullable = false)
 |-- `raw.hourOfDay`: long (nullable = false)

but as you see the schema has a new escaped column name.

If i do the above and attempt to drop the old column with the escaped name, it will drop the old column, but after that any attempt to access the new column results in something like:

org.apache.spark.sql.AnalysisException: Cannot resolve column name "`raw.sensor2`" among (`raw.hourOfDay`, `raw.minOfDay`, `raw.dayOfWeek`, `raw.sensor2`);

as if it's now understanding the backtick as par of the name and not an escape character.

So how do I 'replace' my old column with withColumn without changing the name?

(PS: note that my column names are parametric, so I use a loop over the names. I used specific string names here for clarity: the escape sequence really look like "`"+colName+"`" )

EDIT:

right now the only trick I found was to do:

for (t <- df.columns) {
      if (t.contains(".")) {
        df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`")))
        df = df.drop(df.col("`" + t + "`"))
        df = df.withColumnRenamed("`" + t + "`", t)
      }
      else {
        df = df.withColumn(t, myUdf(df.col(t)))
      }
    }

Not really efficient i guess...

EDIT:

The documentation state:

def withColumn(colName: String, col: Column): DataFrame
Returns a new DataFrame by adding a column 
or replacing the existing column that has the same name.

So replacing a column should not be a problem. However as pointed by @Glennie below, using a new name works fine, so this may be a bug in Spark 1.6

like image 911
MrE Avatar asked Mar 14 '16 23:03

MrE


1 Answers

Thanks for the trick.

df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`")))
df = df.drop(df.col("`" + t + "`"))
df = df.withColumnRenamed("`" + t + "`", t)

It is working fine for me. Looking forward to see the better solution. Just to remind that we will be having similar issues with '#' character too.

like image 103
Pavan Challa Avatar answered Oct 04 '22 10:10

Pavan Challa