I am relatively new to Apache Spark and Python and was wondering how to get the size of a RDD. I have a RDD that looks like this:
[[‘ID: 6993.1066',
'Time: 15:53:43',
'Lab: West',
'Lab-Tech: Nancy McNabb, ',
'\tBob Jones, Harry Lim, ',
'\tSue Smith, Will Smith, ',
'\tTerry Smith, Nandini Chandra, ',
]]
Is there a method or function in pyspark that can give the size how many tuples in a RDD? The one above has 7.
Scala has something like: myRDD.length.
Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.
The answer is that rdd. count() is an "action" — it is an eager operation, because it has to return an actual number. The RDD operations you've performed before count() were "transformations" — they transformed an RDD into another lazily. In effect the transformations were not actually performed, just queued up.
For RDD individual element's size, this appears to be the way
>>> rdd = sc.parallelize([(1,2,'the'),(5,2,5),(1,1,'apple')])
>>> rdd.map(lambda x: len(x)).collect()
[3, 3, 3]
For overall element count within RDD
>>> rdd.count()
3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With