Remove a suffix if present on a string column of a DataFrame

Question

I'm using Spark 2.2. and I'd like to know what is the best option in term of performance to remove a suffix on a column of a DataFrame

Using an udf

val removeSuffix = udf { (id: String) =>
    if (id != null && id.endsWith("XXX")) {
      id.dropRight(3)
    } else {
      id
    }
  }
df.withColumn("c", udf("col"))

Or using the regexp built-in function

df.withColumn("c", regexp_replace($"col", "XXX$", ""))

I know that udfs are known to be slow but is it faster to evaluate a regexp for each row ?

[2018-01-21 update based on answer by user8983815]

I've written a benchmarch and the results are a bit surprising

[info] Benchmark                                      Mode  Cnt    Score   Error  Units
[info] RemoveSuffixBenchmark.builtin_optimized        avgt   10  103,188 ± 3,526  ms/op
[info] RemoveSuffixBenchmark.builtin_regexp_replace_  avgt   10   99,173 ± 7,313  ms/op
[info] RemoveSuffixBenchmark.udf                      avgt   10   94,570 ± 5,707  ms/op

For those who are interested, the code is here : https://github.com/YannMoisan/spark-jmh

user8983815 · Accepted Answer

I doubt regexp_replace will cause serious performance issue, but if you really concerned

import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.Column

def removeSuffix(c: Column) = when(c.endsWith("XXX"), c.substr(lit(0), length(c) - 3)).otherwise(c)

Used as:

scala> Seq("fooXXX", "bar").toDF("s").select(removeSuffix($"s").alias("s")).show
+---+
|  s|
+---+
|foo|
|bar|
+---+

Remove a suffix if present on a string column of a DataFrame

Tags:

dataframe

apache-spark

Yann Moisan

1 Answers

user8983815

Recent Activity

Donate For Us

Remove a suffix if present on a string column of a DataFrame

Tags:

dataframe

apache-spark

Yann Moisan

1 Answers

user8983815

Related questions

Recent Activity

Donate For Us