Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply same function to all fields of spark dataframe row

I have dataframe in which I have about 1000s ( variable) columns.

I want to make all values upper case.

Here is the approach I have thought of , can you suggest if this is best way.

  • Take row
  • Find schema and store in array and find how many fields are there.
  • map through each row in data frame and upto limit of number of elements in array
  • apply function to upper case each fields and return row
like image 927
user2230605 Avatar asked Dec 02 '15 08:12

user2230605


People also ask

How do I apply a function to each row of Spark DataFrame?

By using apply() you call a function to every row of pandas DataFrame. Here the add() function will be applied to every row of pandas DataFrame. In order to iterate row by row in apply() function use axis=1 .

How do you apply a function in Spark?

The syntax for Pyspark Apply Function to ColumnThe Import is to be used for passing the user-defined function. B:- The Data frame model used and the user-defined function that is to be passed for the column name. It takes up the column name as the parameter, and the function can be passed along.

How do I apply a function to each column in pandas?

Python's Pandas Library provides an member function in Dataframe class to apply a function along the axis of the Dataframe i.e. along each row or column i.e. Important Arguments are: func : Function to be applied to each column or row. This function accepts a series and returns a series.

How do I get specific rows in Spark DataFrame?

Method 6: Using select() with collect() method This method is used to select a particular row from the dataframe, It can be used with collect() function. where, dataframe is the pyspark dataframe. Columns is the list of columns to be displayed in each row.


1 Answers

I needed to do similar but had to write my own function to convert empty strings within a dataframe to null. This is what I did.

import org.apache.spark.sql.functions.{col, udf} 
import spark.implicits._ 

def emptyToNull(_str: String): Option[String] = {
  _str match {
    case d if (_str == null || _str.trim.isEmpty) => None
    case _ => Some(_str)
  }
}
val emptyToNullUdf = udf(emptyToNull(_: String))

val df = Seq(("a", "B", "c"), ("D", "e ", ""), ("", "", null)).toDF("x", "y", "z")
df.select(df.columns.map(c => emptyToNullUdf(col(c)).alias(c)): _*).show

+----+----+----+
|   x|   y|   z|
+----+----+----+
|   a|   B|   c|
|   D|  e |null|
|null|null|null|
+----+----+----+

Here's a more refined function of emptyToNull using options instead of null.

def emptyToNull(_str: String): Option[String] = Option(_str) match {
  case ret @ Some(s) if (s.trim.nonEmpty) => ret
  case _ => None
}
like image 113
Tony Fraser Avatar answered Dec 06 '22 00:12

Tony Fraser