I am hitting a webservice with Python's requests
library and the endpoint is returning a (very large) CSV file which I then want to stream into a database. The code looks like this:
response = requests.get(url, auth=auth, stream=True)
if response.status_code == 200:
stream_csv_into_database(response)
Now when the database is a MongoDB database, the loading works perfectly using a DictReader
:
def stream_csv_into_database(response):
.
.
.
for record in csv.DictReader(response.iter_lines(), delimiter='\t'):
product_count += 1
product = {k:v for (k,v) in record.iteritems() if v}
product['_id'] = product_count
collection.insert(product)
However, I am switching from MongoDB to Amazon RedShift, which I can already access just fine using psycopg2
. I can open connections and make simple queries just fine, but what I want to do is use my streamed response from the webservice and use psycopg2's copy_expert
to load the RedShift table. Here is what I tried so far:
def stream_csv_into_database(response, campaign, config):
print 'Loading product feed for {0}'.format(campaign)
conn = new_redshift_connection(config) # My own helper, works fine.
table = 'products.' + campaign
cur = conn.cursor()
reader = response.iter_lines()
# Error on following line:
cur.copy_expert("COPY {0} FROM STDIN WITH CSV HEADER DELIMITER '\t'".format(table), reader)
conn.commit()
cur.close()
conn.close()
The error that I get is:
file must be a readable file-like object for COPY FROM; a writable file-like object for COPY TO.
I understand what the error is saying; in fact, I can see from the psycopg2 documentation that copy_expert
calls copy_from
, which:
Reads data from a file-like object appending them to a database table (COPY table FROM file syntax). The source file must have both read() and readline() method.
My problem is that I cannot find a way to make the response
object be a file-like object! I tried both .data
and .iter_lines
without success. I certainly do not want to download the entire multi-gigabyte file from the webservice and then upload it to RedShift. There must be a way to use the streaming response as a file-like object that psycopg2 can copy into RedShift. Anyone know what I am missing?
Writing response to file When writing responses to file you need to use the open function with the appropriate file write mode. For text responses you need to use "w" - plain write mode. For binary responses you need to use "wb" - binary write mode.
When one makes a request to a URI, it returns a response. This Response object in terms of python is returned by requests. method(), method being – get, post, put, etc.
Python requests are generally used to fetch the content from a particular resource URI. Whenever we make a request to a specified URI through Python, it returns a response object. Now, this response object would be used to access certain features such as content, headers, etc.
You could use the response.raw
file object, but take into account that any content encoding (such as GZIP or Deflate compression) will still be in place unless you set the decode_content
flag to True
when calling .read()
, which psycopg2 will not.
You can set the flag on the raw
file object to change the default to decompressing-while-reading:
response.raw.decode_content = True
and then use the response.raw
file object to csv.DictReader()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With