Apply same function to all fields of spark dataframe row

1 Answers

I needed to do similar but had to write my own function to convert empty strings within a dataframe to null. This is what I did.

import org.apache.spark.sql.functions.{col, udf} 
import spark.implicits._ 

def emptyToNull(_str: String): Option[String] = {
  _str match {
    case d if (_str == null || _str.trim.isEmpty) => None
    case _ => Some(_str)
  }
}
val emptyToNullUdf = udf(emptyToNull(_: String))

val df = Seq(("a", "B", "c"), ("D", "e ", ""), ("", "", null)).toDF("x", "y", "z")
df.select(df.columns.map(c => emptyToNullUdf(col(c)).alias(c)): _*).show

+----+----+----+
|   x|   y|   z|
+----+----+----+
|   a|   B|   c|
|   D|  e |null|
|null|null|null|
+----+----+----+

Here's a more refined function of emptyToNull using options instead of null.

def emptyToNull(_str: String): Option[String] = Option(_str) match {
  case ret @ Some(s) if (s.trim.nonEmpty) => ret
  case _ => None
}

113

answered Dec 06 '22 00:12

Tony Fraser

Related questions
                            
                                How to save a partitioned parquet file in Spark 2.1?
                            
                                How do I delete files in hdfs directory after reading it using scala?
                            
                                File already exists error writing new files from dataframe
                            
                                Kafka Structured Streaming KafkaSourceProvider could not be instantiated
                            
                                How to get rid of "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties" message?
                            
                                Is there a way to filter a field not containing something in a spark dataframe using scala?
                            
                                Spark SQL change format of the number
                            
                                key not found: _PYSPARK_DRIVER_CALLBACK_HOST
                            
                                Error while using Hive context in spark : object hive is not a member of package org.apache.spark.sql
                            
                                Scala/Spark version compatibility
                            
                                Selecting only numeric/string columns names from a Spark DF in pyspark
                            
                                How to allocate more executors per worker in Standalone cluster mode?
                            
                                PySpark - Adding a Column from a list of values using a UDF
                            
                                spark partition data writing by timestamp
                            
                                Invalid Spark URL in local spark session
                            
                                UnsatisfiedLinkError: no snappyjava in java.library.path when running Spark MLLib Unit test within Intellij
                            
                                How can I efficiently read multiple json files into a Dataframe or JavaRDD?
                            
                                spark error RDD type not found when creating RDD
                            
                                What is the best way to define custom methods on a DataFrame?
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apply same function to all fields of spark dataframe row

Tags:

apache-spark

apache-spark-sql

user2230605

People also ask

1 Answers

Tony Fraser

Recent Activity

Donate For Us