I am loading about 2 - 2.5 million records into a Postgres database every day.
I then read this data with pd.read_sql to turn it into a dataframe and then I do some column manipulation and some minor merging. I am saving this modified data as a separate table for other people to use.
When I do pd.to_sql it takes forever. If I save a csv file and use COPY FROM in Postgres, the whole thing only takes a few minutes but the server is on a separate machine and it is a pain to transfer files there.
Using psycopg2, it looks like I can use copy_expert to benefit from the bulk copying, but still use python. I want to, if possible, avoid writing an actual csv file. Can I do this in memory with a pandas dataframe?
Here is an example of my pandas code. I would like to add the copy_expert or something to make saving this data much faster if possible.
for date in required_date_range: df = pd.read_sql(sql=query, con=pg_engine, params={'x' : date}) ... do stuff to the columns ... df.to_sql('table_name', pg_engine, index=False, if_exists='append', dtype=final_table_dtypes)
Can someone help me with example code? I would prefer to use pandas still and it would be nice to do it in memory. If not, I will just write a csv temporary file and do it that way.
Edit- here is my final code which works. It only takes a couple of hundred seconds per date (millions of rows) instead of a couple of hours.
to_sql = """COPY %s FROM STDIN WITH CSV HEADER"""
def process_file(conn, table_name, file_object): fake_conn = cms_dtypes.pg_engine.raw_connection() fake_cur = fake_conn.cursor() fake_cur.copy_expert(sql=to_sql % table_name, file=file_object) fake_conn.commit() fake_cur.close() #after doing stuff to the dataframe s_buf = io.StringIO() df.to_csv(s_buf) process_file(cms_dtypes.pg_engine, 'fact_cms_employee', s_buf)
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
You can work with datasets that are much larger than memory, as long as each partition (a regular pandas DataFrame) fits in memory.
By using pandas. DataFrame. to_csv() method you can write/save/export a pandas DataFrame to CSV File. By default to_csv() method export DataFrame to a CSV file with comma delimiter and row index as the first column.
Python module io
(docs) has necessary tools for file-like objects.
import io # text buffer s_buf = io.StringIO() # saving a data frame to a buffer (same as with a regular file): df.to_csv(s_buf)
Edit. (I forgot) In order to read from the buffer afterwards, its position should be set to the beginning:
s_buf.seek(0)
I'm not familiar with psycopg2
but according to docs both copy_expert
and copy_from
can be used, for example:
cur.copy_from(s_buf, table)
(For Python 2, see StringIO.)
I had problems implementing the solution from ptrj.
I think the issue stems from pandas setting the pos of the buffer to the end.
See as follows:
from StringIO import StringIO df = pd.DataFrame({"name":['foo','bar'],"id":[1,2]}) s_buf = StringIO() df.to_csv(s_buf) s_buf.__dict__ # Output # {'softspace': 0, 'buflist': ['foo,1\n', 'bar,2\n'], 'pos': 12, 'len': 12, 'closed': False, 'buf': ''}
Notice that pos is at 12. I had to set the pos to 0 in order for the subsequent copy_from command to work
s_buf.pos = 0 cur = conn.cursor() cur.copy_from(s_buf, tablename, sep=',') conn.commit()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With