Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas .to_sql timing out with RDS

I have a 22 million row .csv file (~850mb) that I am trying to load into a postgres db on Amazon RDS. It fails every time (I get a time-out error), even when I split the file into smaller parts (each of 100,000 rows) and even when I use chunksize.

All I am doing at the moment is loading the .csv as a dataframe and then writing it to the db using df.to_sql(table_name, engine, index=False, if_exists='append', chunksize=1000)

I am using create_engine from sqlalchemy to create the connection: engine = create_engine('postgresql:database_info')

I have tested writing smaller amounts of data with psycopg2 without a problem, but it takes around 50 seconds to write 1000 rows. Obviously for 22m rows that won't work.

Is there anything else I can try?

like image 648
e h Avatar asked May 17 '15 12:05

e h


2 Answers

The pandas DataFrame.to_sql() method is not especially designed for large inserts, since it does not utilize the PostgreSQL COPY command. Regular SQL queries can time out, it's not the fault of pandas, it's controlled by the database server but can be modified per connection, see this page and search for 'statement_timeout'.

What I would recommend you to do is to consider using Redshift, which is optimized for datawarehousing and can read huge data dumps directly from S3 buckets using the Redshift Copy command.

If you are in no position to use Redshift, I would still recommend finding a way to do this operation using the PostgreSQL COPY command, since it was invented to circumvent exactly the problem you are experiencing.

like image 78
firelynx Avatar answered Nov 17 '22 10:11

firelynx


You can to write the dataframe to a cString and then write this to the database using the copy_from method in Psycopg which I believe does implement the PostgreSql COPY command that @firelynx mentions.

import cStringIO
dboutput = cStringIO.StringIO()
output = output.T.to_dict().values()   
dboutput.write('\n'.join([ ''.join([row['1_str'],'\t',
                                    row['2_str'], '\t',
                                    str(row['3_float'])
                                    ])   for row in output]))
dboutput.seek(0)
cursor.copy_from(dboutput, 'TABLE_NAME')
connenction.commit()

where output is originally a pandas dataframe with columns [1_str, 2_str, 3_float] that you want to write to the database.

like image 1
RoachLord Avatar answered Nov 17 '22 10:11

RoachLord