Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove a suffix if present on a string column of a DataFrame

I'm using Spark 2.2. and I'd like to know what is the best option in term of performance to remove a suffix on a column of a DataFrame

Using an udf

val removeSuffix = udf { (id: String) =>
    if (id != null && id.endsWith("XXX")) {
      id.dropRight(3)
    } else {
      id
    }
  }
df.withColumn("c", udf("col"))

Or using the regexp built-in function

df.withColumn("c", regexp_replace($"col", "XXX$", ""))

I know that udfs are known to be slow but is it faster to evaluate a regexp for each row ?

[2018-01-21 update based on answer by user8983815]

I've written a benchmarch and the results are a bit surprising

[info] Benchmark                                      Mode  Cnt    Score   Error  Units
[info] RemoveSuffixBenchmark.builtin_optimized        avgt   10  103,188 ± 3,526  ms/op
[info] RemoveSuffixBenchmark.builtin_regexp_replace_  avgt   10   99,173 ± 7,313  ms/op
[info] RemoveSuffixBenchmark.udf                      avgt   10   94,570 ± 5,707  ms/op

For those who are interested, the code is here : https://github.com/YannMoisan/spark-jmh

like image 531
Yann Moisan Avatar asked Jan 29 '26 12:01

Yann Moisan


1 Answers

I doubt regexp_replace will cause serious performance issue, but if you really concerned

import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.Column

def removeSuffix(c: Column) = when(c.endsWith("XXX"), c.substr(lit(0), length(c) - 3)).otherwise(c)

Used as:

scala> Seq("fooXXX", "bar").toDF("s").select(removeSuffix($"s").alias("s")).show
+---+
|  s|
+---+
|foo|
|bar|
+---+
like image 180
user8983815 Avatar answered Jan 31 '26 06:01

user8983815



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!