Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient boolean reductions `any`, `all` for PySpark RDD?

Tags:

apache-spark

PySpark supports common reductions like sum, min, count, ... Does it support boolean reductions like all and any?

I can always fold over or_ and and_ but this seems inefficient.

like image 806
MRocklin Avatar asked Jun 08 '14 16:06

MRocklin


People also ask

What are the limitations of RDDs?

There are some drawbacks of using RDDs though: RDD code can sometimes be very opaque. Developers might struggle to find out what exactly the code is trying to compute. RDDs cannot be optimized by Spark, as Spark cannot look inside the lambda functions and optimize the operations.

How many RDDs can Cogroup () can work at once?

cogroup() can be used for much more than just implementing joins. We can also use it to implement intersect by key. Additionally, cogroup() can work on three or more RDDs at once.

Is RDD obsolete?

Are they being deprecated? The answer is a resounding NO! What's more is you can seamlessly move between DataFrame or Dataset and RDDs at will—by simple API method calls—and DataFrames and Datasets are built on top of RDDs.

Why we use parallelize in Spark?

parallelize() method is the SparkContext's parallelize method to create a parallelized collection. This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data: Now that we have created ... Get PySpark Cookbook now with the O'Reilly learning platform.


2 Answers

this is very late, but all on a set of boolean values z is the same as min(z) == True and any is the same as max(z) == True

like image 113
Zach Lamberty Avatar answered Nov 15 '22 09:11

Zach Lamberty


No the underlying Scala API doesn't have it so the Python one definitely won't. I don't think they will add it either as it's very easy to define in terms of filter.

Yes using fold would be inefficient because it won't parallelelize. Do something like .filter(!condition).take(1).isEmpty to mean .forall(condition) and .filter(condition).take(1).nonEmpty to mean .exists(condition)

(General suggestion: the underlying Scala API is generally more flexible than Python API, suggest you move to it - it also makes debugging much easier as you have less layers to dig through. Scala means Scalable Language - it's much better for scalable applications and more robust than dynamically typed languages)

like image 34
samthebest Avatar answered Nov 15 '22 11:11

samthebest