Count number of elements in each pyspark RDD partition

1 Answers

If you are asking: can we get the number of elements in an iterator without iterating through it? The answer is No.

But we don't have to store it in memory, as in the post you mentioned:

def count_in_a_partition(idx, iterator):
  count = 0
  for _ in iterator:
    count += 1
  return idx, count

data = sc.parallelize([
    1, 2, 3, 4
], 4)

data.mapPartitionsWithIndex(count_in_a_partition).collect()

EDIT

Note that your code is very close to the solution, just that mapPartitions needs to return an iterator:

def count_in_a_partition(iterator):
  yield sum(1 for _ in iterator)

data.mapPartitions(count_in_a_partition).collect()

185

answered Oct 13 '22 03:10

shuaiyuancn

Related questions
                            
                                How to concatenate multiple columns in PySpark with a separator?
                            
                                Pyspark dataframe column to list
                            
                                Run spark SQL on CHD5.4.1 NoClassDefFoundError
                            
                                Broadcast Annoy object in Spark (for nearest neighbors)?
                            
                                Adding the resulting TFIDF calculation to the dataframe of the original documents in Pyspark
                            
                                Selecting values from non-null columns in a PySpark DataFrame
                            
                                Does Spark Dataframe have an equivalent option of Panda's merge indicator?
                            
                                How to get the difference between two RDDs in PySpark?
                            
                                Use pandas with Spark
                            
                                Set thresholds in PySpark multinomial logistic regression
                            
                                PySpark Boolean Pivot
                            
                                How to get today - “6 months” date in PySpark(SQL) [duplicate]
                            
                                Generating monthly timestamps between two dates in pyspark dataframe
                            
                                Efficient pyspark join
                            
                                PySpark: filtering with isin returns empty dataframe
                            
                                Pyspark: Create Schema from Json Schema involving Array columns
                            
                                pandas group by and find first non null value for all columns
                            
                                Spark withColumn() performing power functions
                            
                                'SparkContext' object has no attribute 'textfile'
                            
                                PySpark - Add a new column with a Rank by User

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count number of elements in each pyspark RDD partition

Tags:

pyspark

partitioning

Matt Frei

People also ask

1 Answers

shuaiyuancn

Recent Activity

Donate For Us