PySpark supports common reductions like <code>sum</code>, <code>min</code>, <code>count</code>, ... Does it support boolean reductions like <code>all</code> and <code>any</code>? I can always <code>fold</code> over <code>or_</code> and <code>and_</code> but this seems inefficient.

this is very late, but <code>all</code> on a set of <code>boolean</code> values <code>z</code> is the same as <code>min(z) == True</code> and <code>any</code> is the same as <code>max(z) == True</code>

No the underlying Scala API doesn't have it so the Python one definitely won't. I don't think they will add it either as it's very easy to define in terms of <code>filter</code>. Yes using <code>fold</code> would be inefficient because it won't parallelelize. Do something like <code>.filter(!condition).take(1).isEmpty</code> to mean <code>.forall(condition)</code> and <code>.filter(condition).take(1).nonEmpty</code> to mean <code>.exists(condition)</code> (General suggestion: the underlying Scala API is generally more flexible than Python API, suggest you move to it - it also makes debugging much easier as you have less layers to dig through. Scala means Scalable Language - it's much better for scalable applications and more robust than dynamically typed languages)

Efficient boolean reductions `any`, `all` for PySpark RDD?

2 Answers

this is very late, but all on a set of boolean values z is the same as min(z) == True and any is the same as max(z) == True

113

answered Nov 15 '22 09:11

Zach Lamberty

No the underlying Scala API doesn't have it so the Python one definitely won't. I don't think they will add it either as it's very easy to define in terms of filter.

Yes using fold would be inefficient because it won't parallelelize. Do something like .filter(!condition).take(1).isEmpty to mean .forall(condition) and .filter(condition).take(1).nonEmpty to mean .exists(condition)

(General suggestion: the underlying Scala API is generally more flexible than Python API, suggest you move to it - it also makes debugging much easier as you have less layers to dig through. Scala means Scalable Language - it's much better for scalable applications and more robust than dynamically typed languages)

answered Nov 15 '22 11:11

samthebest

Related questions
                            
                                Access names of fields in struct Spark SQL
                            
                                Spark SQL's Scala API - TimestampType - No Encoder found for org.apache.spark.sql.types.TimestampType
                            
                                Spark dataframe add a row for every existing row
                            
                                Change the Datatype of columns in PySpark dataframe
                            
                                Java & Spark : add unique incremental id to dataset
                            
                                Pyspark transform method that's equivalent to the Scala Dataset#transform method
                            
                                How to query datasets in avro format?
                            
                                How to standardize ONE column in Spark using StandardScaler?
                            
                                What's the difference between Dataset.col() and functions.col() in Spark?
                            
                                How to transpose/pivot the rows data to column in Spark Scala? [duplicate]
                            
                                Spark-sqlserver connection
                            
                                How to make sure my DataFrame frees its memory?
                            
                                exception in thread main java.lang.exceptionininitializerError When installing spark without hadoop
                            
                                Join two DataFrames where the join key is different and only select some columns
                            
                                How to set environment variable in databricks?
                            
                                spark: How does salting work in dealing with skewed data
                            
                                What is ExternalRDDScan in the DAG?
                            
                                What is the difference between "predicate pushdown" and "projection pushdown"?
                            
                                How to calculate size of dataframe in spark scala
                            
                                AttributeError: 'DataFrame' object has no attribute '_data'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient boolean reductions `any`, `all` for PySpark RDD?

Tags:

apache-spark

MRocklin

People also ask

2 Answers

Zach Lamberty

samthebest

Recent Activity

Donate For Us