To apply a function to a column in Spark, the common way (only way?) seems to be
df.withColumn(colName, myUdf(df.col(colName))
fine, but I have columns with dots in the name, and to access thecolumn, I need to escape the name with backtick " ` "
Problem is: if I use that escaped name, the .withColumn function creates a new column with the escaped name
df.printSchema
root
|-- raw.hourOfDay: long (nullable = false)
|-- raw.minOfDay: long (nullable = false)
|-- raw.dayOfWeek: long (nullable = false)
|-- raw.sensor2: long (nullable = false)
df = df.withColumn("raw.hourOfDay", df.col("raw.hourOfDay"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "raw.hourOfDay" among (raw.hourOfDay, raw.minOfDay, raw.dayOfWeek, raw.sensor2);
this works:
df = df.withColumn("`raw.hourOfDay`", df.col("`raw.hourOfDay`"))
df: org.apache.spark.sql.DataFrame = [raw.hourOfDay: bigint, raw.minOfDay: bigint, raw.dayOfWeek: bigint, raw.sensor2: bigint, `raw.hourOfDay`: bigint]
scala> df.printSchema
root
|-- raw.hourOfDay: long (nullable = false)
|-- raw.minOfDay: long (nullable = false)
|-- raw.dayOfWeek: long (nullable = false)
|-- raw.sensor2: long (nullable = false)
|-- `raw.hourOfDay`: long (nullable = false)
but as you see the schema has a new escaped column name.
If i do the above and attempt to drop the old column with the escaped name, it will drop the old column, but after that any attempt to access the new column results in something like:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "`raw.sensor2`" among (`raw.hourOfDay`, `raw.minOfDay`, `raw.dayOfWeek`, `raw.sensor2`);
as if it's now understanding the backtick as par of the name and not an escape character.
So how do I 'replace' my old column with withColumn
without changing the name?
(PS: note that my column names are parametric, so I use a loop over the names. I used specific string names here for clarity: the escape sequence really look like "`"+colName+"`" )
EDIT:
right now the only trick I found was to do:
for (t <- df.columns) {
if (t.contains(".")) {
df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`")))
df = df.drop(df.col("`" + t + "`"))
df = df.withColumnRenamed("`" + t + "`", t)
}
else {
df = df.withColumn(t, myUdf(df.col(t)))
}
}
Not really efficient i guess...
EDIT:
The documentation state:
def withColumn(colName: String, col: Column): DataFrame
Returns a new DataFrame by adding a column
or replacing the existing column that has the same name.
So replacing a column should not be a problem. However as pointed by @Glennie below, using a new name works fine, so this may be a bug in Spark 1.6
Thanks for the trick.
df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`")))
df = df.drop(df.col("`" + t + "`"))
df = df.withColumnRenamed("`" + t + "`", t)
It is working fine for me. Looking forward to see the better solution. Just to remind that we will be having similar issues with '#' character too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With