I have a Spark DataFrame loaded up in memory, and I want to take the mean (or any aggregate operation) over the columns. How would I do that? (In <code>numpy</code>, this is known as taking an operation over <code>axis=1</code>). If one were calculating the mean of the DataFrame down the rows (<code>axis=0</code>), then this is already built in: <pre class="prettyprint"><code>from pyspark.sql import functions as F F.mean(...) </code></pre> But is there a way to programmatically do this against the entries in the columns? For example, from the DataFrame below <pre class="prettyprint"><code>+--+--+---+---+ |id|US| UK|Can| +--+--+---+---+ | 1|50| 0| 0| | 1| 0|100| 0| | 1| 0| 0|125| | 2|75| 0| 0| +--+--+---+---+ </code></pre> Omitting <code>id</code>, the means would be <pre class="prettyprint"><code>+------+ | mean| +------+ | 16.66| | 33.33| | 41.67| | 25.00| +------+ </code></pre>

All you need here is a standard SQL like this: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT (US + UK + CAN) / 3 AS mean FROM df </code></pre> which can be used directly with <code>SqlContext.sql</code> or expressed using DSL <pre class="prettyprint lang-py prettyprint-override"><code>df.select(((col("UK") + col("US") + col("CAN")) / lit(3)).alias("mean")) </code></pre> If you have a larger number of columns you can generate expression as follows: <pre class="prettyprint lang-py prettyprint-override"><code>from functools import reduce from operator import add from pyspark.sql.functions import col, lit n = lit(len(df.columns) - 1.0) rowMean = (reduce(add, (col(x) for x in df.columns[1:])) / n).alias("mean") df.select(rowMean) </code></pre> or <pre class="prettyprint"><code>rowMean = (sum(col(x) for x in df.columns[1:]) / n).alias("mean") df.select(rowMean) </code></pre> Finally its equivalent in Scala: <pre class="prettyprint lang-scala prettyprint-override"><code>df.select(df.columns .drop(1) .map(col) .reduce(_ + _) .divide(df.columns.size - 1) .alias("mean")) </code></pre> In a more complex scenario you can combine columns using <code>array</code> function and use an UDF to compute statistics: <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np from pyspark.sql.functions import array, udf from pyspark.sql.types import FloatType combined = array(*(col(x) for x in df.columns[1:])) median_udf = udf(lambda xs: float(np.median(xs)), FloatType()) df.select(median_udf(combined).alias("median")) </code></pre> The same operation expressed using Scala API: <pre class="prettyprint lang-scala prettyprint-override"><code>val combined = array(df.columns.drop(1).map(col).map(_.cast(DoubleType)): _*) val median_udf = udf((xs: Seq[Double]) => breeze.stats.DescriptiveStats.percentile(xs, 0.5)) df.select(median_udf(combined).alias("median")) </code></pre> Since Spark 2.4 an alternative approach is to combine values into an array and apply <code>aggregate</code> expression. See for example Spark Scala row-wise average by handling null.

in Scala something like this would do it <pre class="prettyprint"><code>val cols = Seq("US","UK","Can") f.map(r => (r.getAs[Int]("id"),r.getValuesMap(cols).values.fold(0.0)(_+_)/cols.length)).toDF </code></pre>

Spark DataFrame: Computing row-wise mean (or any aggregate operation)

Tags:

python

apache-spark

apache-spark-sql

pyspark

I have a Spark DataFrame loaded up in memory, and I want to take the mean (or any aggregate operation) over the columns. How would I do that? (In numpy, this is known as taking an operation over axis=1).

If one were calculating the mean of the DataFrame down the rows (axis=0), then this is already built in:

from pyspark.sql import functions as F
F.mean(...)

But is there a way to programmatically do this against the entries in the columns? For example, from the DataFrame below

+--+--+---+---+
|id|US| UK|Can|
+--+--+---+---+
| 1|50|  0|  0|
| 1| 0|100|  0|
| 1| 0|  0|125|
| 2|75|  0|  0|
+--+--+---+---+

Omitting id, the means would be

+------+
|  mean|
+------+
| 16.66|
| 33.33|
| 41.67|
| 25.00|
+------+

721

asked Sep 19 '15 17:09

hlin117

2 Answers

All you need here is a standard SQL like this:

SELECT (US + UK + CAN) / 3 AS mean FROM df

which can be used directly with SqlContext.sql or expressed using DSL

df.select(((col("UK") + col("US") + col("CAN")) / lit(3)).alias("mean"))

If you have a larger number of columns you can generate expression as follows:

from functools import reduce
from operator import add
from pyspark.sql.functions import col, lit

n = lit(len(df.columns) - 1.0)
rowMean  = (reduce(add, (col(x) for x in df.columns[1:])) / n).alias("mean")

df.select(rowMean)

rowMean  = (sum(col(x) for x in df.columns[1:]) / n).alias("mean")
df.select(rowMean)

Finally its equivalent in Scala:

df.select(df.columns
  .drop(1)
  .map(col)
  .reduce(_ + _)
  .divide(df.columns.size - 1)
  .alias("mean"))

In a more complex scenario you can combine columns using array function and use an UDF to compute statistics:

import numpy as np
from pyspark.sql.functions import array, udf
from pyspark.sql.types import FloatType

combined = array(*(col(x) for x in df.columns[1:]))
median_udf = udf(lambda xs: float(np.median(xs)), FloatType())

df.select(median_udf(combined).alias("median"))

The same operation expressed using Scala API:

val combined = array(df.columns.drop(1).map(col).map(_.cast(DoubleType)): _*)
val median_udf = udf((xs: Seq[Double]) => 
    breeze.stats.DescriptiveStats.percentile(xs, 0.5))

df.select(median_udf(combined).alias("median"))

Since Spark 2.4 an alternative approach is to combine values into an array and apply aggregate expression. See for example Spark Scala row-wise average by handling null.

192

answered Sep 20 '22 13:09

zero323

in Scala something like this would do it

val cols = Seq("US","UK","Can")
f.map(r => (r.getAs[Int]("id"),r.getValuesMap(cols).values.fold(0.0)(_+_)/cols.length)).toDF

answered Sep 17 '22 13:09

Arnon Rotem-Gal-Oz

Related questions
                            
                                Python Wave byte data
                            
                                How to force os.system() to use bash instead of shell
                            
                                Computing a binomial probability for huge numbers
                            
                                Parsing and computing boolean set definitions
                            
                                Removing the first line of CSV file [duplicate]
                            
                                TypeError:__init__() got an unexpected keyword argument 'delay'
                            
                                ImportError: No module named nose.tools
                            
                                PyQt4 code not working on PyQt5 (QHeaderView)
                            
                                Insert a custom object in a sorted list
                            
                                Serving Excel(xlsx) file to the user for download in Django(Python)
                            
                                Python zipfile, How to set the compression level?
                            
                                Smoothing a 2-D figure
                            
                                django order by count of many to many object
                            
                                Route requests based on the Accept header in Flask
                            
                                Mocked Class & Asserting Method Calls
                            
                                How to compress a file with shutil.make_archive in python?
                            
                                How to parse this custom log file in Python
                            
                                OpenCV findContours() complains if used with black-white image
                            
                                replace multiple occurrences of any special character by one in python
                            
                                Python Post call throwing 400 Bad Request

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With