I'm using Spark 2.2. and I'd like to know what is the best option in term of performance to remove a suffix on a column of a DataFrame
Using an udf
val removeSuffix = udf { (id: String) =>
if (id != null && id.endsWith("XXX")) {
id.dropRight(3)
} else {
id
}
}
df.withColumn("c", udf("col"))
Or using the regexp built-in function
df.withColumn("c", regexp_replace($"col", "XXX$", ""))
I know that udfs are known to be slow but is it faster to evaluate a regexp for each row ?
[2018-01-21 update based on answer by user8983815]
I've written a benchmarch and the results are a bit surprising
[info] Benchmark Mode Cnt Score Error Units
[info] RemoveSuffixBenchmark.builtin_optimized avgt 10 103,188 ± 3,526 ms/op
[info] RemoveSuffixBenchmark.builtin_regexp_replace_ avgt 10 99,173 ± 7,313 ms/op
[info] RemoveSuffixBenchmark.udf avgt 10 94,570 ± 5,707 ms/op
For those who are interested, the code is here : https://github.com/YannMoisan/spark-jmh
I doubt regexp_replace will cause serious performance issue, but if you really concerned
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
def removeSuffix(c: Column) = when(c.endsWith("XXX"), c.substr(lit(0), length(c) - 3)).otherwise(c)
Used as:
scala> Seq("fooXXX", "bar").toDF("s").select(removeSuffix($"s").alias("s")).show
+---+
| s|
+---+
|foo|
|bar|
+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With