I have a created two dataframes in pyspark from my hive table as:
data1 = spark.sql("""
SELECT ID, MODEL_NUMBER, MODEL_YEAR ,COUNTRY_CODE
from MODEL_TABLE1 where COUNTRY_CODE in ('IND','CHN','USA','RUS','AUS')
""");
each country is having millions of unique ID in alphanumeric format.
data2 = spark.sql("""
SELECT ID,MODEL_NUMBER, MODEL_YEAR, COUNTRY_CODE
from MODEL_TABLE2 where COUNTRY_CODE in ('IND','CHN')
""");
I want to join both of these dataframe using pyspark on ID column.
How can we re-partition our data so that its get distributed uniformly across the partitions.
Can i use below to reparation my data?
newdf1 = data2.repartition(100, "ID")
newdf2 = data2.repartition(100, "ID")
what would be the best way for partitioning so that join work faster?
PySpark (Spark with Python) Similarly, in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, so to use with DataFrame first you need to convert to RDD.
You can use Scala's Try class and execute show partitions on the required table. Later you can check numPartitions . If the value is -1 then the table is not partitioned. Hive's metastore is usually on a RDBMS.
In spark, the partition is an atomic chunk of data. Simply putting, it is a logical division of data stored on a node over the cluster. In apache spark, partitions are basic units of parallelism and RDDs, in spark are the collection of partitions.
As far as I know your approach repartition
providing an ID column is correct. Consider the following as proof of concept using spark_partition_id()
to get the corrresponding partition id:
import pandas as pd
import numpy as np
from pyspark.sql.functions import spark_partition_id
def create_dummy_data():
data = np.vstack([np.random.randint(0, 5, size=10),
np.random.random(10)])
df = pd.DataFrame(data.T, columns=["id", "values"])
return spark.createDataFrame(df)
def show_partition_id(df):
"""Helper function to show partition."""
return df.select(*df.columns, spark_partition_id().alias("pid")).show()
df1 = create_dummy_data()
df2 = create_dummy_data()
show_partition_id(df1)
+---+-------------------+---+
| id| values|pid|
+---+-------------------+---+
|1.0| 0.6051170383675885| 0|
|3.0| 0.4613520717857513| 0|
|0.0| 0.797734780966592| 1|
|2.0|0.35594664760134587| 1|
|2.0|0.08223203758144915| 2|
|0.0| 0.3112880092048709| 2|
|4.0| 0.2689639324292137| 3|
|1.0| 0.6466782159542134| 3|
|0.0| 0.8340472796153436| 3|
|4.0| 0.8054752411745659| 3|
+---+-------------------+---+
show_partition_id(df2)
+---+-------------------+---+
| id| values|pid|
+---+-------------------+---+
|4.0| 0.8950517294190533| 0|
|3.0| 0.4084717827425539| 0|
|3.0| 0.798146627431009| 1|
|4.0| 0.8039931522181247| 1|
|3.0| 0.732125135531736| 2|
|0.0| 0.536328329270619| 2|
|1.0|0.25952064363007576| 3|
|2.0| 0.1958334111199559| 3|
|0.0| 0.728098753644471| 3|
|0.0| 0.9825387111807906| 3|
+---+-------------------+---+
show_partition_id(df1.repartition(2, "id"))
+---+-------------------+---+
| id| values|pid|
+---+-------------------+---+
|1.0| 0.6051170383675885| 0|
|3.0| 0.4613520717857513| 0|
|4.0| 0.2689639324292137| 0|
|1.0| 0.6466782159542134| 0|
|4.0| 0.8054752411745659| 0|
|0.0| 0.797734780966592| 1|
|2.0|0.35594664760134587| 1|
|2.0|0.08223203758144915| 1|
|0.0| 0.3112880092048709| 1|
|0.0| 0.8340472796153436| 1|
+---+-------------------+---+
show_partition_id(df2.repartition(2, "id"))
+---+-------------------+---+
| id| values|pid|
+---+-------------------+---+
|4.0| 0.8950517294190533| 0|
|3.0| 0.4084717827425539| 0|
|3.0| 0.798146627431009| 0|
|4.0| 0.8039931522181247| 0|
|3.0| 0.732125135531736| 0|
|1.0|0.25952064363007576| 0|
|0.0| 0.536328329270619| 1|
|2.0| 0.1958334111199559| 1|
|0.0| 0.728098753644471| 1|
|0.0| 0.9825387111807906| 1|
+---+-------------------+---+
After repartitioning, ids 0 and 2 are located on the same partition and the rest is on the other partition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With