How to partition Spark RDD when importing Postgres using JDBC?

Tags:

I am importing a Postgres database into Spark. I know that I can partition on import, but that requires that I have a numeric column (I don't want to use the value column because it's all over the place and doesn't maintain order):

df = spark.read.format('jdbc').options(url=url, dbtable='tableName', properties=properties).load()
df.printSchema()

root
 |-- id: string (nullable = false)
 |-- timestamp: timestamp (nullable = false)
 |-- key: string (nullable = false)
 |-- value: double (nullable = false)

Instead, I am converting the dataframe into an rdd (of enumerated tuples) and trying to partition that instead:

rdd = df.rdd.flatMap(lambda x: enumerate(x)).partitionBy(20)

Note that I used 20 because I have 5 workers with one core each in my cluster, and 5*4=20.

Unfortunately, the following command still takes forever to execute:

result = rdd.first()

Therefore I am wondering if my logic above makes sense? Am I doing anything wrong? From the web GUI, it looks like the workers are not being used:

enter image description here

449

asked Sep 20 '16 15:09

FullStack

1 Answers

Since you already know you can partition by a numeric column this is probably what you should do. Here is the trick. First lets find a minimum and maximum epoch:

url = ...
properties = ...

min_max_query = """(
    SELECT
        CAST(min(extract(epoch FROM timestamp)) AS bigint), 
        CAST(max(extract(epoch FROM timestamp)) AS bigint)
    FROM tablename
) tmp"""

min_epoch, max_epoch = spark.read.jdbc(
    url=url, table=min_max_query, properties=properties
).first()

and use it to query the table:

numPartitions = ...

query = """(
    SELECT *, CAST(extract(epoch FROM timestamp) AS bigint) AS epoch
    FROM tablename) AS tmp"""

spark.read.jdbc(
    url=url, table=query,
    lowerBound=min_epoch, upperBound=max_epoch + 1, 
    column="epoch", numPartitions=numPartitions, properties=properties
).drop("epoch")

Since this splits data into ranges of the same size it is relatively sensitive to data skew so you should use it with caution.

You could also provide a list of disjoint predicates as a predicates argument.

predicates= [
    "id BETWEEN 'a' AND 'c'",
    "id BETWEEN 'd' AND 'g'",
    ...   # Continue to get full coverage an desired number of predicates
]

spark.read.jdbc(
    url=url, table="tablename", properties=properties, 
    predicates=predicates
)

The latter approach is much more flexible and can address certain issues with non-uniform data distribution but requires more knowledge about the data.

Using partitionBy fetches data first and then performs full shuffle to get desired number of partitions so it is relativistically expensive.

answered Nov 15 '22 08:11

zero323

Related questions
                            
                                Heroku run rake db:migrate error rake command not found
                            
                                Does a CASE statement lose alias scope in PostgreSQL?
                            
                                Proper way to convert bytea from Postgres back to a string in python
                            
                                Load large data to R data.table from Postgresql
                            
                                ifelse & grepl commands when using dplyr for SQL in-db operations
                            
                                Get unique pairs from self-join, plus rows without a match
                            
                                node-postgres get error connect ECONNREFUSED
                            
                                How sql with-recursive statement interpreted?
                            
                                how to manually change record in heroku database
                            
                                SQL CTE Syntax to DELETE / INSERT rows
                            
                                How to create a VIEW from function with parameters?
                            
                                java.sql.SQLException: Connection has already been closed
                            
                                How to create a postgres column with a default timestamp value of epoch?
                            
                                PHP Connect to PostgreSQL using ssh2_tunnel
                            
                                How to create a Postgres table with unique combined primary key?
                            
                                How to save a hash or json string into a jsonb field
                            
                                PostgreSQL update JSONB column with value from another column
                            
                                What are differences between SQL queries?
                            
                                Sequelize upsert() never updates and only inserts
                            
                                Postgres 'if not exists' fails because the sequence exists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to partition Spark RDD when importing Postgres using JDBC?

Tags:

postgresql

jdbc

apache-spark

rdd

pyspark

FullStack

People also ask

1 Answers

zero323

Recent Activity

Donate For Us