Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach

What would be the most efficient way to insert millions of records say 50-million from a Spark dataframe to Postgres Tables. I have done this from spark to MSSQL in the past by making use of bulk copy and batch size option which was successful too.

Is there something similar that can be here for Postgres?

Adding the code I have tried and the time it took to run the process:

def inserter():
    start = timer()
    sql_res.write.format("jdbc").option("numPartitions","5").option("batchsize","200000")\
    .option("url", "jdbc:postgresql://xyz.com:5435/abc_db") \
    .option("dbtable", "public.full_load").option("user", "root").option("password", "password").save()
    end = timer()
    print(timedelta(seconds=end-start))
inserter()

So I did the above approach for 10 million records and had 5 parallel connections as specified in numPartitions and also tried batch size of 200k.

The total time it took for the process was 0:14:05.760926 (fourteen minutes and five seconds).

Is there any other efficient approach which would reduce the time?

What would be the efficient or optimal batch size I can use ? Will increasing my batch size do the job quicker ? Or opening multiple connections i.e > 5 help me make the process quicker ?

On an average 14 mins for 10 million records is not bad, but looking for people out there who would have done this before to help answer this question.

like image 883
Chetan_Vasudevan Avatar asked Dec 10 '19 14:12

Chetan_Vasudevan


People also ask

How much data can PySpark handle?

How large a cluster can Spark scale to? Many organizations run Spark on clusters of thousands of nodes. The largest cluster we know has 8000 of them. In terms of data size, Spark has been shown to work well up to petabytes.

How many inserts can Postgres handle?

When using Postgres if you do need writes exceeding 10,000s of INSERT s per second we turn to the Postgres COPY utility for bulk loading. COPY is capable of handling 100,000s of writes per second.


1 Answers

I actually did kind of the same work a while ago but using Apache Sqoop.

I would say that for answering this questions we have to try to optimize the communication between Spark and PostgresSQL, specifically the data flowing from Spark to PostgreSql.

But be careful, do not forget Spark side. It does not make sense to execute mapPartitions if the number of partitions is too high compared with the number of maximum connections which PostgreSQL support, if you have too many partitions and you are opening a connection for each one, you will probably have the following error org.postgresql.util.PSQLException: FATAL: sorry, too many clients already.

In order to tune the insertion process I would approach the problem following the next steps:

  • Remember the number of partitions is important. Check the number of partitions and then adjust it based on the number of parallel connection you want to have. You might want to have one connection per partition, so I would suggest to check coalesce, as is mentioned here.
  • Check the max number of connections which your postgreSQL instance support and you want to increase the number.
  • For inserting data into PostgreSQL is recommended using COPY command. Here is also a more elaborated answer about how to speed up postgreSQL insertion.

Finally, there is no silver bullet to do this job. You can use all the tips I mentioned above but it will really depends on your data and use cases.

like image 98
dbustosp Avatar answered Oct 05 '22 10:10

dbustosp