The pyspark RDD documentation
http://spark.apache.org/docs/1.2.1/api/python/pyspark.html#pyspark.RDD
does not show any method(s) to display partition information for an RDD.
Is there any way to get that information without executing an additional step e.g.:
myrdd.mapPartitions(lambda x: iter[1]).sum()
The above does work .. but seems like extra effort.
PySpark (Spark with Python) Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.
As already mentioned above, one partition is created for each block of the file in HDFS which is of size 64MB. However, when creating a RDD a second argument can be passed that defines the number of partitions to be created for an RDD. The above line of code will create an RDD named textFile with 5 partitions.
You can use Scala's Try class and execute show partitions on the required table. Later you can check numPartitions . If the value is -1 then the table is not partitioned.
I missed it: very simple:
rdd.getNumPartitions()
Not used to the java-ish getFooMethod() anymore ;)
Update : Adding in the comment from @dnlbrky :
dataFrame.rdd.getNumPartitions()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With