Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: Get number of records per partition

Tags:

I want to check how can we get information about each partition such as total no. of records in each partition on driver side when Spark job is submitted with deploy mode as a yarn cluster in order to log or print on the console.

like image 895
nilesh1212 Avatar asked Sep 04 '17 07:09

nilesh1212


People also ask

How many tasks does Spark run on each partition?

Spark assigns one task per partition and each worker can process one task at a time.


2 Answers

I'd use built-in function. It should be as efficient as it gets:

import org.apache.spark.sql.functions.spark_partition_id  df.groupBy(spark_partition_id).count 
like image 194
Alper t. Turker Avatar answered Sep 25 '22 12:09

Alper t. Turker


You can get the number of records per partition like this :

df   .rdd   .mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}   .toDF("partition_number","number_of_records")   .show 

But this will also launch a Spark Job by itself (because the file must be read by spark to get the number of records).

Spark could may also read hive table statistics, but I don't know how to display those metadata..

like image 22
Raphael Roth Avatar answered Sep 26 '22 12:09

Raphael Roth