Turn pandas dataframe into a file-like object in memory?

Tags:

I am loading about 2 - 2.5 million records into a Postgres database every day.

I then read this data with pd.read_sql to turn it into a dataframe and then I do some column manipulation and some minor merging. I am saving this modified data as a separate table for other people to use.

When I do pd.to_sql it takes forever. If I save a csv file and use COPY FROM in Postgres, the whole thing only takes a few minutes but the server is on a separate machine and it is a pain to transfer files there.

Using psycopg2, it looks like I can use copy_expert to benefit from the bulk copying, but still use python. I want to, if possible, avoid writing an actual csv file. Can I do this in memory with a pandas dataframe?

Here is an example of my pandas code. I would like to add the copy_expert or something to make saving this data much faster if possible.

    for date in required_date_range:         df = pd.read_sql(sql=query, con=pg_engine, params={'x' : date})         ...         do stuff to the columns         ...         df.to_sql('table_name', pg_engine, index=False, if_exists='append',  dtype=final_table_dtypes)

Can someone help me with example code? I would prefer to use pandas still and it would be nice to do it in memory. If not, I will just write a csv temporary file and do it that way.

Edit- here is my final code which works. It only takes a couple of hundred seconds per date (millions of rows) instead of a couple of hours.

to_sql = """COPY %s FROM STDIN WITH CSV HEADER"""

def process_file(conn, table_name, file_object):     fake_conn = cms_dtypes.pg_engine.raw_connection()     fake_cur = fake_conn.cursor()     fake_cur.copy_expert(sql=to_sql % table_name, file=file_object)     fake_conn.commit()     fake_cur.close()   #after doing stuff to the dataframe     s_buf = io.StringIO()     df.to_csv(s_buf)      process_file(cms_dtypes.pg_engine, 'fact_cms_employee', s_buf)

670

asked Jul 05 '16 12:07

trench

2 Answers

Python module io(docs) has necessary tools for file-like objects.

import io  # text buffer s_buf = io.StringIO()  # saving a data frame to a buffer (same as with a regular file): df.to_csv(s_buf)

Edit. (I forgot) In order to read from the buffer afterwards, its position should be set to the beginning:

s_buf.seek(0)

I'm not familiar with psycopg2 but according to docs both copy_expert and copy_from can be used, for example:

cur.copy_from(s_buf, table)

(For Python 2, see StringIO.)

150

answered Sep 23 '22 04:09

ptrj

I had problems implementing the solution from ptrj.

I think the issue stems from pandas setting the pos of the buffer to the end.

See as follows:

from StringIO import StringIO df = pd.DataFrame({"name":['foo','bar'],"id":[1,2]}) s_buf = StringIO() df.to_csv(s_buf) s_buf.__dict__  # Output # {'softspace': 0, 'buflist': ['foo,1\n', 'bar,2\n'], 'pos': 12, 'len': 12, 'closed': False, 'buf': ''}

Notice that pos is at 12. I had to set the pos to 0 in order for the subsequent copy_from command to work

s_buf.pos = 0 cur = conn.cursor() cur.copy_from(s_buf, tablename, sep=',') conn.commit()

answered Sep 22 '22 04:09

a_bigbadwolf

Related questions
                            
                                Pass data from parent to child component in vue.js
                            
                                Dependency injection into abstract class typescript(Angular2)
                            
                                Swift 3 and CGContextDrawImage
                            
                                Printing an array with slf4j only prints the first element
                            
                                Why do I get an error when adding an integer to a floating point?
                            
                                ffmpeg converting from mkv to mp4 without re-encoding
                            
                                translate virtual address to physical address
                            
                                Fast convert JSON column into Pandas dataframe
                            
                                Loop/reflect through all properties in all EF Models to set Column Type
                            
                                Annotation Processor in IntelliJ and Gradle
                            
                                dockerfile: how use CMD or ENTRYPOINT from base image
                            
                                How to mark rpc as deprecated

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With