Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

psycopg2 Postgres COPY EXPERT to Pandas read_csv using memory buffer fails with ValueError

Tags:

python

pandas

So I'm running the following code using the psycopg2 driver in Python 3.5 to Pandas 19.x.

 buf = io.StringIO()
 cursor = conn.cursor()
 sql_query = 'COPY ('+ base_sql + ' limit 100) TO STDOUT WITH CSV HEADER'
 cursor.copy_expert(sql_query, buf)
 df = pd.read_csv(buf.getvalue(),engine='c')
 buf.close()

The read_csv blows chunks when reading from the memory buffer:

pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4175)()

pandas\parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:8333)()

C:\Users\....\AppData\Local\Continuum\Anaconda3\lib\genericpath.py in exists(path)
     17     """Test whether a path exists.  Returns False for broken symbolic links"""
     18     try:
---> 19         os.stat(path)
     20     except OSError:
     21         return False

ValueError: stat: path too long for Windows

Uh..wot path? buf is in memory. What am I missing here?

Just FYI, the Copy to seems to be working as expected.

SOLUTION CODE BELOW

Thanks to the answer below, My query speed doubled using this method and my memory use dropped by 500%. Here is my final test code included to help others resolve their performance issues. I'd love to see any code that improves this! Be sure to link back to this question in your question.

# COPY TO CSV quick and dirty performance test
import io
import sys

start = time.time()
conn_str_copy= r'postgresql+psycopg2://' + user_id + r":" + pswd + r"@xxx.xxx.xxx.xxx:ppppp/my_database"
result = urlparse(conn_str_copy)
username = result.username
password = result.password
database = result.path[1:]
hostname = result.hostname

size = 2**30
buf = io.BytesIO()
# buf = io.StringIO()

engine = create_engine(conn_str_copy)
conn_copy= psycopg2.connect(
    database=database, user=username, password=password, host=hostname)

cursor_copy = conn_copy.cursor()
sql_query = 'COPY ('+ my_sql_query + ' ) TO STDOUT WITH CSV HEADER'
cursor_copy.copy_expert(sql_query,  buf, size)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')
tmp = buf.seek(0)
df = pd.read_csv(buf,engine='c', low_memory=False )
buf.close()
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds') 

Speed is ~4 minutes to copy the data from postgres and less than 30 seconds to load it into the the pandas dataframe. Note the copy command is a feature of the psycopg2 driver and may not work in other drivers.

like image 952
Harvey Avatar asked Oct 18 '22 20:10

Harvey


1 Answers

You have to pass a file handle or a filename to pandas.read_csv().

Passing buf.getvalue() makes pandas read_csv believe you're passing a filename since object doesn't have a read method, except that the "filename" is the buffer and it's seen as too long (windows limits to 255 characters for filenames)

You almost got it. Since buf is already a file-like object, just pass it as-is. Small detail: you have to rewind it because previous cursor.copy_expert(sql_query, buf) call probably used write and buf position is at the end (try without it, you'll probably get an empty dataframe)

buf.seek(0)  # rewind because you're at the end of the buffer
df = pd.read_csv(buf,engine='c')
like image 101
Jean-François Fabre Avatar answered Oct 20 '22 11:10

Jean-François Fabre