Looking for some info on using custom partitioner in Pyspark. I have a dataframe holding country data for various countries. So if I do repartition on country column, it will distribute my data into n partitions and keeping similar country data to specific partitions. This is creating a skew partition data when I see using <code>glom()</code> method. Some countries like USA and CHN has huge amount of data in particular dataframe. I want to repartition my dataframe such that if the countries are USA and CHN then it will further split into some 10 partitions else keep the partitions same for other countries like IND, THA, AUS etc. Can we extend partitioner class in Pyspark code. I have read this in below link that we can extend scala partitioner class in scala Spark application and can modify the partitioner class to use custom logic to repartition our data on base of requirements. Like the one I have.. please help to achieve this solution in Pyspark.. See the link below What is an efficient way to partition by column but maintain a fixed partition count? <hr> I am using Spark version 2.3.0.2 and below is my Dataframe structure: <pre class="prettyprint"><code>datadf= spark.sql(""" SELECT ID_NUMBER ,SENDER_NAME ,SENDER_ADDRESS ,REGION_CODE ,COUNTRY_CODE from udb.sometable """); </code></pre> The incoming data has data for six countries, like <code>AUS</code>, <code>IND</code>, <code>THA</code>, <code>RUS</code>, <code>CHN</code> and <code>USA</code>. <code>CHN</code> and <code>USA</code> has skew data. so if I do <code>repartition</code> on <code>COUNTRY_CODE</code>, two partitions contains a lot data whereas others are fine. I checked this using <code>glom()</code> method. <pre class="prettyprint"><code>newdf = datadf.repartition("COUNTRY_CODE") from pyspark.sql import SparkSession from pyspark.sql import HiveContext, DataFrameWriter, DataFrame newDF = datadf.repartitionByRange(3,"COUNTRY_CODE","USA") </code></pre> I was trying repartition my data into 3 more partitions for country <code>USA</code> and <code>CHN</code> only and would like to keep the other countries data into single partition. <pre class="prettyprint"><code>This is what I am expecting AUS- one partition IND- one partition THA- one partition RUS- one partition CHN- three partition USA- three partition </code></pre> <blockquote> Traceback (most recent call last): File "", line 1, in File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1182, in getattr "'%s' object has no attribute '%s'" % (self.class.name, name)) AttributeError: 'DataFrame' object has no attribute 'repartitionByRange' </blockquote>

Try something like this with hashing: <pre class="prettyprint"><code>newDf = oldDf.repartition(N, $"col1", $"coln") </code></pre> or for ranging approach: <pre class="prettyprint"><code>newDF = oldDF.repartitionByRange(N, $"col1", $"coln") </code></pre> There is no custom partitioning for DF's just yet. In your case I would go for hashing, but there are no guarantees. But if your data is skew you may need some extra work, like 2 columns for partitioning being the simplest approach. E.g. an existing or new column - in this case a column that applies a grouping against a given country, e.g. 1 .. N, and the partition on two cols. For countries with many grouping you get N synthetic sub divisions; for others with low cardinality, only with 1 such group number. Not too hard. Both partitioning can take more than 1 col. In my view uniform number filling of partitions takes a lot of effort and not really attainable, but a next best approach as in this here can suffice well enough. Amounts to custom partitioning to an extent. Otherwise, using .withColumn on a DF you can simulate custom partitioning with those rules and filling of a new DF column and then apply the repartitionByRange. Also not so hard.

Partitioning of Data Frame in Pyspark using Custom Partitioner

Tags:

apache-spark-sql

pyspark

Looking for some info on using custom partitioner in Pyspark. I have a dataframe holding country data for various countries. So if I do repartition on country column, it will distribute my data into n partitions and keeping similar country data to specific partitions. This is creating a skew partition data when I see using glom() method.

Some countries like USA and CHN has huge amount of data in particular dataframe. I want to repartition my dataframe such that if the countries are USA and CHN then it will further split into some 10 partitions else keep the partitions same for other countries like IND, THA, AUS etc. Can we extend partitioner class in Pyspark code.

I have read this in below link that we can extend scala partitioner class in scala Spark application and can modify the partitioner class to use custom logic to repartition our data on base of requirements. Like the one I have.. please help to achieve this solution in Pyspark.. See the link below What is an efficient way to partition by column but maintain a fixed partition count?

I am using Spark version 2.3.0.2 and below is my Dataframe structure:

datadf= spark.sql("""
    SELECT    
        ID_NUMBER ,SENDER_NAME ,SENDER_ADDRESS ,REGION_CODE ,COUNTRY_CODE
    from udb.sometable
""");

The incoming data has data for six countries, like AUS, IND, THA, RUS, CHN and USA. CHN and USA has skew data.

so if I do repartition on COUNTRY_CODE, two partitions contains a lot data whereas others are fine. I checked this using glom() method.

newdf = datadf.repartition("COUNTRY_CODE")

from pyspark.sql import SparkSession
from pyspark.sql import  HiveContext, DataFrameWriter, DataFrame

newDF = datadf.repartitionByRange(3,"COUNTRY_CODE","USA")

I was trying repartition my data into 3 more partitions for country USA and CHN only and would like to keep the other countries data into single partition.

This is what I am expecting 
AUS- one partition
IND- one partition
THA- one partition
RUS- one partition
CHN- three partition
USA- three partition

Traceback (most recent call last): File "", line 1, in File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1182, in getattr "'%s' object has no attribute '%s'" % (self.class.name, name)) AttributeError: 'DataFrame' object has no attribute 'repartitionByRange'

899

asked Oct 13 '18 07:10

vikrant rana

1 Answers

Try something like this with hashing:

newDf = oldDf.repartition(N, $"col1", $"coln")

or for ranging approach:

newDF = oldDF.repartitionByRange(N, $"col1", $"coln")

There is no custom partitioning for DF's just yet.

In your case I would go for hashing, but there are no guarantees.

But if your data is skew you may need some extra work, like 2 columns for partitioning being the simplest approach.

E.g. an existing or new column - in this case a column that applies a grouping against a given country, e.g. 1 .. N, and the partition on two cols.

For countries with many grouping you get N synthetic sub divisions; for others with low cardinality, only with 1 such group number. Not too hard. Both partitioning can take more than 1 col.

In my view uniform number filling of partitions takes a lot of effort and not really attainable, but a next best approach as in this here can suffice well enough. Amounts to custom partitioning to an extent.

Otherwise, using .withColumn on a DF you can simulate custom partitioning with those rules and filling of a new DF column and then apply the repartitionByRange. Also not so hard.

137

answered Oct 25 '22 22:10

thebluephantom

Related questions
                            
                                PySpark - Compare DataFrames
                            
                                AWS Glue - can't set spark.yarn.executor.memoryOverhead
                            
                                PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation
                            
                                How to check specific partition data from Spark partitions in Pyspark
                            
                                pyspark - aggregate (sum) vector element-wise
                            
                                Passing multiple columns in Pandas UDF PySpark
                            
                                Efficient way to add UUID in pyspark [duplicate]
                            
                                Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect
                            
                                Installing Modules for SPARK on worker nodes
                            
                                Spark using Python : save RDD output into text files
                            
                                Spark sum up values regardless of keys
                            
                                Joining PySpark DataFrames on nested field
                            
                                Spark Matrix multiplication with python
                            
                                How to ensure partitioning induced by Spark DataFrame join?
                            
                                pyspark: pip install couldn't find a version
                            
                                What is the purpose of cache an RDD in Apache Spark?
                            
                                What type should it be , after using .toArray() for a Spark vector?
                            
                                Apply a transformation to multiple columns pyspark dataframe
                            
                                Set schema in pyspark dataframe read.csv with null elements
                            
                                How get the percentage of totals for each count after a groupBy in pyspark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With