Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Show partitions on a pyspark RDD

The pyspark RDD documentation

http://spark.apache.org/docs/1.2.1/api/python/pyspark.html#pyspark.RDD

does not show any method(s) to display partition information for an RDD.

Is there any way to get that information without executing an additional step e.g.:

myrdd.mapPartitions(lambda x: iter[1]).sum()

The above does work .. but seems like extra effort.

like image 407
WestCoastProjects Avatar asked Mar 15 '15 00:03

WestCoastProjects


People also ask

How do I check my PySpark DataFrame partitions?

PySpark (Spark with Python) Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.

How many partitions does an RDD have?

As already mentioned above, one partition is created for each block of the file in HDFS which is of size 64MB. However, when creating a RDD a second argument can be passed that defines the number of partitions to be created for an RDD. The above line of code will create an RDD named textFile with 5 partitions.

How do I know if my table is partitioned in Spark?

You can use Scala's Try class and execute show partitions on the required table. Later you can check numPartitions . If the value is -1 then the table is not partitioned.


1 Answers

I missed it: very simple:

rdd.getNumPartitions()

Not used to the java-ish getFooMethod() anymore ;)

Update : Adding in the comment from @dnlbrky :

dataFrame.rdd.getNumPartitions()
like image 163
WestCoastProjects Avatar answered Oct 13 '22 16:10

WestCoastProjects