How to get the size of an RDD in Pyspark?

Tags:

apache-spark

pyspark

I am relatively new to Apache Spark and Python and was wondering how to get the size of a RDD. I have a RDD that looks like this:

[[‘ID: 6993.1066',
  'Time: 15:53:43',
  'Lab: West',
  'Lab-Tech: Nancy McNabb, ',
  '\tBob Jones, Harry Lim, ',
  '\tSue Smith, Will Smith, ',
  '\tTerry Smith, Nandini Chandra, ',
  ]]

Is there a method or function in pyspark that can give the size how many tuples in a RDD? The one above has 7.

Scala has something like: myRDD.length.

866

asked Feb 21 '18 05:02

Steve McAffer

1 Answers

For RDD individual element's size, this appears to be the way

>>> rdd = sc.parallelize([(1,2,'the'),(5,2,5),(1,1,'apple')])
>>> rdd.map(lambda x: len(x)).collect()
[3, 3, 3]

For overall element count within RDD

>>> rdd.count()
3

answered Oct 20 '22 13:10

Bala

Related questions
                            
                                Creating indices for each group in Spark dataframe
                            
                                java.lang.NoClassDefFoundError: Could not initialize class when launching spark job via spark-submit in scala code
                            
                                multi-processing with spark(PySpark) [duplicate]
                            
                                How to manually set group.id and commit kafka offsets in spark structured streaming?
                            
                                Use of lit() in expr()
                            
                                How to set group.id for consumer group in kafka data source in Structured Streaming?
                            
                                Can SPARK use multicore properly?
                            
                                Pass array as an UDF parameter in Spark SQL
                            
                                How does Spark on Yarn store shuffled files?
                            
                                Setting spark classpaths on EC2: spark.driver.extraClassPath and spark.executor.extraClassPath
                            
                                Basic Spark example not working
                            
                                winutils.exe chmod command doesn't set permission
                            
                                How to iterate scala wrappedArray? (Spark)
                            
                                sparkSession/sparkContext can not get hadoop configuration
                            
                                How to create Spark Dataset or Dataframe from case classes that contains Enums
                            
                                Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)
                            
                                Cumulate arrays from earlier rows (PySpark dataframe)
                            
                                Dropping empty DataFrame partitions in Apache Spark
                            
                                How to merge pyspark and pandas dataframes
                            
                                What is Project node in execution query plan?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With