How to check specific partition data from Spark partitions in Pyspark

Tags:

pyspark

hadoop-partitioning

I have a created two dataframes in pyspark from my hive table as:

data1 = spark.sql("""
   SELECT ID, MODEL_NUMBER, MODEL_YEAR ,COUNTRY_CODE
   from MODEL_TABLE1 where COUNTRY_CODE in ('IND','CHN','USA','RUS','AUS')
""");

each country is having millions of unique ID in alphanumeric format.

data2 = spark.sql("""
   SELECT ID,MODEL_NUMBER, MODEL_YEAR, COUNTRY_CODE
   from MODEL_TABLE2 where COUNTRY_CODE in ('IND','CHN')
""");

I want to join both of these dataframe using pyspark on ID column.

How can we re-partition our data so that its get distributed uniformly across the partitions.

Can i use below to reparation my data?

newdf1 = data2.repartition(100, "ID")
newdf2 = data2.repartition(100, "ID")

what would be the best way for partitioning so that join work faster?

355

asked Oct 04 '18 08:10

vikrant rana

1 Answers

As far as I know your approach repartition providing an ID column is correct. Consider the following as proof of concept using spark_partition_id() to get the corrresponding partition id:

Create some dummy data

import pandas as pd
import numpy as np
from pyspark.sql.functions import spark_partition_id

def create_dummy_data():

    data = np.vstack([np.random.randint(0, 5, size=10), 
                      np.random.random(10)])

    df = pd.DataFrame(data.T, columns=["id", "values"])

    return spark.createDataFrame(df)

def show_partition_id(df):
    """Helper function to show partition."""
    return df.select(*df.columns, spark_partition_id().alias("pid")).show()

df1 = create_dummy_data()
df2 = create_dummy_data()

Show partition id before repartioning

show_partition_id(df1)

+---+-------------------+---+
| id|             values|pid|
+---+-------------------+---+
|1.0| 0.6051170383675885|  0|
|3.0| 0.4613520717857513|  0|
|0.0|  0.797734780966592|  1|
|2.0|0.35594664760134587|  1|
|2.0|0.08223203758144915|  2|
|0.0| 0.3112880092048709|  2|
|4.0| 0.2689639324292137|  3|
|1.0| 0.6466782159542134|  3|
|0.0| 0.8340472796153436|  3|
|4.0| 0.8054752411745659|  3|
+---+-------------------+---+

show_partition_id(df2)

+---+-------------------+---+
| id|             values|pid|
+---+-------------------+---+
|4.0| 0.8950517294190533|  0|
|3.0| 0.4084717827425539|  0|
|3.0|  0.798146627431009|  1|
|4.0| 0.8039931522181247|  1|
|3.0|  0.732125135531736|  2|
|0.0|  0.536328329270619|  2|
|1.0|0.25952064363007576|  3|
|2.0| 0.1958334111199559|  3|
|0.0|  0.728098753644471|  3|
|0.0| 0.9825387111807906|  3|
+---+-------------------+---+

Show partition id after repartitioning

show_partition_id(df1.repartition(2, "id"))

+---+-------------------+---+
| id|             values|pid|
+---+-------------------+---+
|1.0| 0.6051170383675885|  0|
|3.0| 0.4613520717857513|  0|
|4.0| 0.2689639324292137|  0|
|1.0| 0.6466782159542134|  0|
|4.0| 0.8054752411745659|  0|
|0.0|  0.797734780966592|  1|
|2.0|0.35594664760134587|  1|
|2.0|0.08223203758144915|  1|
|0.0| 0.3112880092048709|  1|
|0.0| 0.8340472796153436|  1|
+---+-------------------+---+

show_partition_id(df2.repartition(2, "id"))

+---+-------------------+---+
| id|             values|pid|
+---+-------------------+---+
|4.0| 0.8950517294190533|  0|
|3.0| 0.4084717827425539|  0|
|3.0|  0.798146627431009|  0|
|4.0| 0.8039931522181247|  0|
|3.0|  0.732125135531736|  0|
|1.0|0.25952064363007576|  0|
|0.0|  0.536328329270619|  1|
|2.0| 0.1958334111199559|  1|
|0.0|  0.728098753644471|  1|
|0.0| 0.9825387111807906|  1|
+---+-------------------+---+

After repartitioning, ids 0 and 2 are located on the same partition and the rest is on the other partition.

answered Sep 23 '22 05:09

pansen

Related questions
                            
                                Sending Large CSV to Kafka using python Spark
                            
                                How to pass additional parameters to user-defined methods in pyspark for filter method?
                            
                                pyspark expected zero arguments for construction of ClassDict (for pyspark.mllib.linalg.DenseVector)
                            
                                Pyspark command not recognised
                            
                                PYSPARK : casting string to float when reading a csv file
                            
                                pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989
                            
                                What's the difference among ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD?
                            
                                How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT
                            
                                Convert Sparse Vector to Dense Vector in Pyspark
                            
                                How to create a table as select in pyspark.sql
                            
                                add one column including values from 1 to n in dataframe
                            
                                PySpark: Get first Non-null value of each column in dataframe
                            
                                How to fill none values with a concrete timestamp in DataFrame?
                            
                                pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python
                            
                                Spark deep learning Import error
                            
                                How to transform structured streams with PySpark?
                            
                                How to specify driver class path when using pyspark within a jupyter notebook?
                            
                                PySpark - Compare DataFrames
                            
                                AWS Glue - can't set spark.yarn.executor.memoryOverhead
                            
                                PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With