How can I find median of an <code>RDD</code> of integers using a distributed method, IPython, and Spark? The <code>RDD</code> is approximately 700,000 elements and therefore too large to collect and find the median. This question is similar to this question. However, the answer to the question is using Scala, which I do not know. How can I calculate exact median with Apache Spark? Using the thinking for the Scala answer, I am trying to write a similar answer in Python. I know I first want to sort the <code>RDD</code>. I do not know how. I see the <code>sortBy</code> (Sorts this RDD by the given <code>keyfunc</code>) and <code>sortByKey</code> (Sorts this <code>RDD</code>, which is assumed to consist of (key, value) pairs.) methods. I think both use key value and my <code>RDD</code> only has integer elements. <ol> <li>First, I was thinking of doing <code>myrdd.sortBy(lambda x: x)</code>? </li> <li>Next I will find the length of the rdd (<code>rdd.count()</code>).</li> <li>Finally, I want to find the element or 2 elements at the center of the rdd. I need help with this method too.</li> </ol> EDIT: I had an idea. Maybe I can index my <code>RDD</code> and then key = index and value = element. And then I can try to sort by value? I don't know if this is possible because there is only a <code>sortByKey</code> method.

<h3>Ongoing work</h3> SPARK-30569 - Add DSL functions invoking percentile_approx <h3>Spark 2.0+:</h3> You can use <code>approxQuantile</code> method which implements Greenwald-Khanna algorithm: Python: <pre class="prettyprint"><code>df.approxQuantile("x", [0.5], 0.25) </code></pre> Scala: <pre class="prettyprint"><code>df.stat.approxQuantile("x", Array(0.5), 0.25) </code></pre> where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation. Since Spark 2.2 (SPARK-14352) it supports estimation on multiple columns: <pre class="prettyprint"><code>df.approxQuantile(["x", "y", "z"], [0.5], 0.25) </code></pre> and <pre class="prettyprint"><code>df.approxQuantile(Array("x", "y", "z"), Array(0.5), 0.25) </code></pre> Underlying methods can be also used in SQL aggregation (both global and groped) using <code>approx_percentile</code> function: <pre class="prettyprint lang-sql prettyprint-override"><code>> SELECT approx_percentile(10.0, array(0.5, 0.4, 0.1), 100); [10.0,10.0,10.0] > SELECT approx_percentile(10.0, 0.5, 100); 10.0 </code></pre> <h3>Spark < 2.0</h3> Python As I've mentioned in the comments it is most likely not worth all the fuss. If data is relatively small like in your case then simply collect and compute median locally: <pre class="prettyprint"><code>import numpy as np np.random.seed(323) rdd = sc.parallelize(np.random.randint(1000000, size=700000)) %time np.median(rdd.collect()) np.array(rdd.collect()).nbytes </code></pre> It takes around 0.01 second on my few years old computer and around 5.5MB of memory. If data is much larger sorting will be a limiting factor so instead of getting an exact value it is probably better to sample, collect, and compute locally. But if you really want a to use Spark something like this should do the trick (if I didn't mess up anything): <pre class="prettyprint"><code>from numpy import floor import time def quantile(rdd, p, sample=None, seed=None): """Compute a quantile of order p ∈ [0, 1] :rdd a numeric rdd :p quantile(between 0 and 1) :sample fraction of and rdd to use. If not provided we use a whole dataset :seed random number generator seed to be used with sample """ assert 0 <= p <= 1 assert sample is None or 0 < sample <= 1 seed = seed if seed is not None else time.time() rdd = rdd if sample is None else rdd.sample(False, sample, seed) rddSortedWithIndex = (rdd. sortBy(lambda x: x). zipWithIndex(). map(lambda (x, i): (i, x)). cache()) n = rddSortedWithIndex.count() h = (n - 1) * p rddX, rddXPlusOne = ( rddSortedWithIndex.lookup(x)[0] for x in int(floor(h)) + np.array([0L, 1L])) return rddX + (h - floor(h)) * (rddXPlusOne - rddX) </code></pre> And some tests: <pre class="prettyprint"><code>np.median(rdd.collect()), quantile(rdd, 0.5) ## (500184.5, 500184.5) np.percentile(rdd.collect(), 25), quantile(rdd, 0.25) ## (250506.75, 250506.75) np.percentile(rdd.collect(), 75), quantile(rdd, 0.75) (750069.25, 750069.25) </code></pre> Finally lets define median: <pre class="prettyprint"><code>from functools import partial median = partial(quantile, p=0.5) </code></pre> So far so good but it takes 4.66 s in a local mode without any network communication. There is probably way to improve this, but why even bother? Language independent (Hive UDAF): If you use <code>HiveContext</code> you can also use Hive UDAFs. With integral values: <pre class="prettyprint"><code>rdd.map(lambda x: (float(x), )).toDF(["x"]).registerTempTable("df") sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df") </code></pre> With continuous values: <pre class="prettyprint"><code>sqlContext.sql("SELECT percentile(x, 0.5) FROM df") </code></pre> In <code>percentile_approx</code> you can pass an additional argument which determines a number of records to use.

How to find median and quantiles using Spark

Tags:

python

apache-spark

rdd

pyspark

median

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.

This question is similar to this question. However, the answer to the question is using Scala, which I do not know.

How can I calculate exact median with Apache Spark?

Using the thinking for the Scala answer, I am trying to write a similar answer in Python.

I know I first want to sort the RDD. I do not know how. I see the sortBy (Sorts this RDD by the given keyfunc) and sortByKey (Sorts this RDD, which is assumed to consist of (key, value) pairs.) methods. I think both use key value and my RDD only has integer elements.

First, I was thinking of doing myrdd.sortBy(lambda x: x)?
Next I will find the length of the rdd (rdd.count()).
Finally, I want to find the element or 2 elements at the center of the rdd. I need help with this method too.

EDIT:

I had an idea. Maybe I can index my RDD and then key = index and value = element. And then I can try to sort by value? I don't know if this is possible because there is only a sortByKey method.

567

asked Jul 15 '15 14:07

pr338

1 Answers

Ongoing work

SPARK-30569 - Add DSL functions invoking percentile_approx

Spark 2.0+:

You can use approxQuantile method which implements Greenwald-Khanna algorithm:

Python:

Click to copy

df.approxQuantile("x", [0.5], 0.25)

Scala:

Click to copy

df.stat.approxQuantile("x", Array(0.5), 0.25)

where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation.

Since Spark 2.2 (SPARK-14352) it supports estimation on multiple columns:

Click to copy

df.approxQuantile(["x", "y", "z"], [0.5], 0.25)

and

Click to copy

df.approxQuantile(Array("x", "y", "z"), Array(0.5), 0.25)

Underlying methods can be also used in SQL aggregation (both global and groped) using approx_percentile function:

Click to copy

> SELECT approx_percentile(10.0, array(0.5, 0.4, 0.1), 100);  [10.0,10.0,10.0] > SELECT approx_percentile(10.0, 0.5, 100);  10.0

Spark < 2.0

Python

As I've mentioned in the comments it is most likely not worth all the fuss. If data is relatively small like in your case then simply collect and compute median locally:

Click to copy

import numpy as np  np.random.seed(323) rdd = sc.parallelize(np.random.randint(1000000, size=700000))  %time np.median(rdd.collect()) np.array(rdd.collect()).nbytes

It takes around 0.01 second on my few years old computer and around 5.5MB of memory.

If data is much larger sorting will be a limiting factor so instead of getting an exact value it is probably better to sample, collect, and compute locally. But if you really want a to use Spark something like this should do the trick (if I didn't mess up anything):

Click to copy

from numpy import floor import time  def quantile(rdd, p, sample=None, seed=None):     """Compute a quantile of order p ∈ [0, 1]     :rdd a numeric rdd     :p quantile(between 0 and 1)     :sample fraction of and rdd to use. If not provided we use a whole dataset     :seed random number generator seed to be used with sample     """     assert 0 <= p <= 1     assert sample is None or 0 < sample <= 1      seed = seed if seed is not None else time.time()     rdd = rdd if sample is None else rdd.sample(False, sample, seed)      rddSortedWithIndex = (rdd.         sortBy(lambda x: x).         zipWithIndex().         map(lambda (x, i): (i, x)).         cache())      n = rddSortedWithIndex.count()     h = (n - 1) * p      rddX, rddXPlusOne = (         rddSortedWithIndex.lookup(x)[0]         for x in int(floor(h)) + np.array([0L, 1L]))      return rddX + (h - floor(h)) * (rddXPlusOne - rddX)

And some tests:

Click to copy

np.median(rdd.collect()), quantile(rdd, 0.5) ## (500184.5, 500184.5) np.percentile(rdd.collect(), 25), quantile(rdd, 0.25) ## (250506.75, 250506.75) np.percentile(rdd.collect(), 75), quantile(rdd, 0.75) (750069.25, 750069.25)

Finally lets define median:

Click to copy

from functools import partial median = partial(quantile, p=0.5)

So far so good but it takes 4.66 s in a local mode without any network communication. There is probably way to improve this, but why even bother?

Language independent (Hive UDAF):

If you use HiveContext you can also use Hive UDAFs. With integral values:

Click to copy

rdd.map(lambda x: (float(x), )).toDF(["x"]).registerTempTable("df")  sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df")

With continuous values:

Click to copy

sqlContext.sql("SELECT percentile(x, 0.5) FROM df")

In percentile_approx you can pass an additional argument which determines a number of records to use.

193

answered Oct 13 '22 06:10

zero323

Related questions
                            
                                Is there a way to get the largest integer one can use in Python? [duplicate]
                            
                                How to extend Python Enum?
                            
                                How to share conda environments across platforms
                            
                                How to determine the length of lists in a pandas dataframe column
                            
                                Getting started with the Python debugger, pdb [closed]
                            
                                Generate RFC 3339 timestamp in Python [duplicate]
                            
                                How to solve a pair of nonlinear equations using Python?
                            
                                How to convert string to datetime format in pandas python?
                            
                                In Python 2, how do I write to variable in the parent scope?
                            
                                Python requests speed up using keep-alive
                            
                                How to Copy from IPython session without terminal prompts
                            
                                How to set opacity of background colour of graph with Matplotlib
                            
                                How to get HTML from a beautiful soup object
                            
                                Python pandas check if dataframe is not empty
                            
                                How to trigger function on value change?
                            
                                Python: Uniqueness for list of lists
                            
                                Pandas read_csv dtype read all columns but few as string
                            
                                django template system, calling a function inside a model
                            
                                Get a function argument's default value?
                            
                                Determining how many times a substring occurs in a string in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find median and quantiles using Spark

Tags:

python

apache-spark

rdd

pyspark

median

pr338

People also ask

1 Answers

Ongoing work

Spark 2.0+:

Spark < 2.0

zero323

Recent Activity

Donate For Us