So I'm running the following code using the psycopg2 driver in Python 3.5 to Pandas 19.x.
buf = io.StringIO()
cursor = conn.cursor()
sql_query = 'COPY ('+ base_sql + ' limit 100) TO STDOUT WITH CSV HEADER'
cursor.copy_expert(sql_query, buf)
df = pd.read_csv(buf.getvalue(),engine='c')
buf.close()
The read_csv blows chunks when reading from the memory buffer:
pandas\parser.pyx in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4175)()
pandas\parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas\parser.c:8333)()
C:\Users\....\AppData\Local\Continuum\Anaconda3\lib\genericpath.py in exists(path)
17 """Test whether a path exists. Returns False for broken symbolic links"""
18 try:
---> 19 os.stat(path)
20 except OSError:
21 return False
ValueError: stat: path too long for Windows
Uh..wot path? buf is in memory. What am I missing here?
Just FYI, the Copy to seems to be working as expected.
SOLUTION CODE BELOW
Thanks to the answer below, My query speed doubled using this method and my memory use dropped by 500%. Here is my final test code included to help others resolve their performance issues. I'd love to see any code that improves this! Be sure to link back to this question in your question.
# COPY TO CSV quick and dirty performance test
import io
import sys
start = time.time()
conn_str_copy= r'postgresql+psycopg2://' + user_id + r":" + pswd + r"@xxx.xxx.xxx.xxx:ppppp/my_database"
result = urlparse(conn_str_copy)
username = result.username
password = result.password
database = result.path[1:]
hostname = result.hostname
size = 2**30
buf = io.BytesIO()
# buf = io.StringIO()
engine = create_engine(conn_str_copy)
conn_copy= psycopg2.connect(
database=database, user=username, password=password, host=hostname)
cursor_copy = conn_copy.cursor()
sql_query = 'COPY ('+ my_sql_query + ' ) TO STDOUT WITH CSV HEADER'
cursor_copy.copy_expert(sql_query, buf, size)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')
tmp = buf.seek(0)
df = pd.read_csv(buf,engine='c', low_memory=False )
buf.close()
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')
Speed is ~4 minutes to copy the data from postgres and less than 30 seconds to load it into the the pandas dataframe. Note the copy command is a feature of the psycopg2 driver and may not work in other drivers.
You have to pass a file handle or a filename to pandas.read_csv()
.
Passing buf.getvalue()
makes pandas read_csv
believe you're passing a filename since object doesn't have a read
method, except that the "filename" is the buffer and it's seen as too long (windows limits to 255 characters for filenames)
You almost got it. Since buf
is already a file-like object, just pass it as-is. Small detail: you have to rewind it because previous cursor.copy_expert(sql_query, buf)
call probably used write
and buf
position is at the end (try without it, you'll probably get an empty dataframe)
buf.seek(0) # rewind because you're at the end of the buffer
df = pd.read_csv(buf,engine='c')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With