I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far: a) To do this for a single column (let's say Col A), this line of code seems to work: <pre class="prettyprint"><code>df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA")) .first()(0).asInstanceOf[Double]) .otherwise($"ColA")) </code></pre> b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me). Thanks!

Spark >= 2.2 You can use <code>org.apache.spark.ml.feature.Imputer</code> (which supports both mean and median strategy). Scala : <pre class="prettyprint"><code>import org.apache.spark.ml.feature.Imputer val imputer = new Imputer() .setInputCols(df.columns) .setOutputCols(df.columns.map(c => s"${c}_imputed")) .setStrategy("mean") imputer.fit(df).transform(df) </code></pre> Python: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.ml.feature import Imputer imputer = Imputer( inputCols=df.columns, outputCols=["{}_imputed".format(c) for c in df.columns] ) imputer.fit(df).transform(df) </code></pre> Spark < 2.2 Here you are: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.functions.mean df.na.fill(df.columns.zip( df.select(df.columns.map(mean(_)): _*).first.toSeq ).toMap) </code></pre> where <pre class="prettyprint lang-scala prettyprint-override"><code>df.columns.map(mean(_)): Array[Column] </code></pre> computes an average for each column, <pre class="prettyprint lang-scala prettyprint-override"><code>df.select(_: *).first.toSeq: Seq[Any] </code></pre> collects aggregated values and converts row to <code>Seq[Any]</code> (I know it is suboptimal but this is the API we have to work with), <pre class="prettyprint lang-scala prettyprint-override"><code>df.columns.zip(_).toMap: Map[String,Any] </code></pre> creates <code>aMap: Map[String, Any]</code> which maps from the column name to its average, and finally: <pre class="prettyprint lang-scala prettyprint-override"><code>df.na.fill(_): DataFrame </code></pre> fills the missing values using: <pre class="prettyprint lang-scala prettyprint-override"><code>fill: Map[String, Any] => DataFrame </code></pre> from <code>DataFrameNaFunctions</code>. To ingore <code>NaN</code> entries you can replace: <pre class="prettyprint"><code>df.select(df.columns.map(mean(_)): _*).first.toSeq </code></pre> with: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.functions.{col, isnan, when} df.select(df.columns.map( c => mean(when(!isnan(col(c)), col(c))) ): _*).first.toSeq </code></pre>

For PySpark, this is the code I used: <pre class="prettyprint"><code>mean_dict = { col: 'mean' for col in df.columns } col_avgs = df.agg( mean_dict ).collect()[0].asDict() col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() } df.fillna( col_avgs ).show() </code></pre> The four steps are: <ol> <li>Create the dictionary <code>mean_dict</code> mapping column names to the aggregate operation (mean)</li> <li>Calculate the mean for each column, and save it as the dictionary <code>col_avgs</code> </li> <li>The column names in <code>col_avgs</code> start with <code>avg(</code> and end with <code>)</code>, e.g. <code>avg(col1)</code>. Strip the parentheses out.</li> <li>Fill the columns of the dataframe with the averages using <code>col_avgs</code> </li> </ol>

For imputing the median (instead of the mean) in PySpark < 2.2 <pre class="prettyprint"><code>## filter numeric cols num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)] ### Compute a dict with <col_name, median_value> median_dict = dict() for c in num_cols: median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0] </code></pre> Then, apply <code>na.fill</code> <pre class="prettyprint"><code>df_imputed = df.na.fill(median_dict) </code></pre>

Replace missing values with mean - Spark Dataframe

Tags:

dataframe

scala

imputation

apache-spark

apache-spark-sql

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:

a) To do this for a single column (let's say Col A), this line of code seems to work:

df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
  .first()(0).asInstanceOf[Double])
  .otherwise($"ColA"))

b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe

c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).

Thanks!

269

asked Oct 15 '16 09:10

Dataminer

3 Answers

Spark >= 2.2

You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).

Scala :

import org.apache.spark.ml.feature.Imputer

val imputer = new Imputer()
  .setInputCols(df.columns)
  .setOutputCols(df.columns.map(c => s"${c}_imputed"))
  .setStrategy("mean")

imputer.fit(df).transform(df)

Python:

from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=df.columns, 
    outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)

Spark < 2.2

Here you are:

import org.apache.spark.sql.functions.mean

df.na.fill(df.columns.zip(
  df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)

where

df.columns.map(mean(_)): Array[Column]

computes an average for each column,

df.select(_: *).first.toSeq: Seq[Any]

collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),

df.columns.zip(_).toMap: Map[String,Any]

creates aMap: Map[String, Any] which maps from the column name to its average, and finally:

df.na.fill(_): DataFrame

fills the missing values using:

fill: Map[String, Any] => DataFrame

from DataFrameNaFunctions.

To ingore NaN entries you can replace:

df.select(df.columns.map(mean(_)): _*).first.toSeq

with:

import org.apache.spark.sql.functions.{col, isnan, when}


df.select(df.columns.map(
  c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq

197

answered Sep 23 '22 20:09

zero323

For PySpark, this is the code I used:

mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()

The four steps are:

Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

answered Sep 24 '22 20:09

Michael P

For imputing the median (instead of the mean) in PySpark < 2.2

## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
   median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]

Then, apply na.fill

df_imputed = df.na.fill(median_dict)