Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the size of an RDD in Pyspark?

I am relatively new to Apache Spark and Python and was wondering how to get the size of a RDD. I have a RDD that looks like this:

[[‘ID: 6993.1066',
  'Time: 15:53:43',
  'Lab: West',
  'Lab-Tech: Nancy McNabb, ',
  '\tBob Jones, Harry Lim, ',
  '\tSue Smith, Will Smith, ',
  '\tTerry Smith, Nandini Chandra, ',
  ]]

Is there a method or function in pyspark that can give the size how many tuples in a RDD? The one above has 7.

Scala has something like: myRDD.length.

like image 866
Steve McAffer Avatar asked Feb 21 '18 05:02

Steve McAffer


People also ask

How do I check the size of a file in PySpark?

Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.

How do you calculate RDD?

The answer is that rdd. count() is an "action" — it is an eager operation, because it has to return an actual number. The RDD operations you've performed before count() were "transformations" — they transformed an RDD into another lazily. In effect the transformations were not actually performed, just queued up.


1 Answers

For RDD individual element's size, this appears to be the way

>>> rdd = sc.parallelize([(1,2,'the'),(5,2,5),(1,1,'apple')])
>>> rdd.map(lambda x: len(x)).collect()
[3, 3, 3]

For overall element count within RDD

>>> rdd.count()
3
like image 67
Bala Avatar answered Oct 20 '22 13:10

Bala