Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach

Tags:

What would be the most efficient way to insert millions of records say 50-million from a Spark dataframe to Postgres Tables. I have done this from spark to MSSQL in the past by making use of bulk copy and batch size option which was successful too.

Is there something similar that can be here for Postgres?

Adding the code I have tried and the time it took to run the process:

def inserter():
    start = timer()
    sql_res.write.format("jdbc").option("numPartitions","5").option("batchsize","200000")\
    .option("url", "jdbc:postgresql://xyz.com:5435/abc_db") \
    .option("dbtable", "public.full_load").option("user", "root").option("password", "password").save()
    end = timer()
    print(timedelta(seconds=end-start))
inserter()

So I did the above approach for 10 million records and had 5 parallel connections as specified in numPartitions and also tried batch size of 200k.

The total time it took for the process was 0:14:05.760926 (fourteen minutes and five seconds).

Is there any other efficient approach which would reduce the time?

What would be the efficient or optimal batch size I can use ? Will increasing my batch size do the job quicker ? Or opening multiple connections i.e > 5 help me make the process quicker ?

On an average 14 mins for 10 million records is not bad, but looking for people out there who would have done this before to help answer this question.

883

asked Dec 10 '19 14:12

Chetan_Vasudevan

1 Answers

I actually did kind of the same work a while ago but using Apache Sqoop.

I would say that for answering this questions we have to try to optimize the communication between Spark and PostgresSQL, specifically the data flowing from Spark to PostgreSql.

But be careful, do not forget Spark side. It does not make sense to execute mapPartitions if the number of partitions is too high compared with the number of maximum connections which PostgreSQL support, if you have too many partitions and you are opening a connection for each one, you will probably have the following error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already.

In order to tune the insertion process I would approach the problem following the next steps:

Remember the number of partitions is important. Check the number of partitions and then adjust it based on the number of parallel connection you want to have. You might want to have one connection per partition, so I would suggest to check coalesce, as is mentioned here.
Check the max number of connections which your postgreSQL instance support and you want to increase the number.
For inserting data into PostgreSQL is recommended using COPY command. Here is also a more elaborated answer about how to speed up postgreSQL insertion.

Finally, there is no silver bullet to do this job. You can use all the tips I mentioned above but it will really depends on your data and use cases.

answered Oct 05 '22 10:10

dbustosp

Related questions
                            
                                hibernate with c3p0: createClob() is not yet implemented
                            
                                Sequelize configuration to retrieve total count with details
                            
                                Cloud9 postgres
                            
                                Unable to start postgresql.service?
                            
                                PostgreSQL, trigrams and similarity
                            
                                "HostName not verified error message" on SSL connection in postgresql
                            
                                How do you do date math that ignores the year?
                            
                                QueryFailedError: the column "price" contain null values - TypeORM - PostgreSQL
                            
                                DATE_FORMAT in postgresql
                            
                                Is PostgreSQL a NoSQL database? [closed]
                            
                                Generic Ruby solution for SQLite3 "LIKE" or PostgreSQL "ILIKE"?
                            
                                How to set correct attribute names to a json aggregated result with GROUP BY clause?
                            
                                postgresql pg_dump without schema name
                            
                                Sequential Scan and Index Scan for primary key return different rows
                            
                                Connecting to Postgres database with read only access.
                            
                                Postgres 10.3: SELECT queries hang for hours
                            
                                Flask and SQLAlchemy causes a lot of IDLE in transaction connections in PostgreSQL
                            
                                What package to use for database migrations in Go? [closed]
                            
                                Cannot connect to postgres server in docker
                            
                                Postgres closes connection during query after a few hundred seconds when using Psycopg2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach

Tags:

postgresql

apache-spark

apache-spark-sql

pyspark

bigdata

Chetan_Vasudevan

People also ask

1 Answers

dbustosp

Recent Activity

Donate For Us