Basically I am working on a python project where I download and index files from the sec edgar database. The problem however, is that when using the requests module, it take a very long time to save the text in a variable (between ~130 and 170 seconds for one file).
The file roughly has around 16 million characters, and I wanted to see if there was any way to easily lower the time it takes to retrieve the text. -- Example:
import requests
url ="https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.text)
Thanks!
What I found is in the code for r.text
, specifically when no encoding was given ( r.encoding == 'None' ). The time spend detecting the encoding was 20 seconds, I was able to skip it by defining the encoding.
...
r.encoding = 'utf-8'
...
In my case, my request was not returning an encoding type. The response was 256k in size, the r.apparent_encoding
was taking 20 seconds.
Looking into the text property function. It tests to see if there is an encoding. If there is None
, it will call the apperent_encoding
function which will scan the text to autodetect the encoding scheme.
On a long string this will take time. By defining the encoding of the response ( as described above), you will skip the detection.
in your above example :
from datetime import datetime
import requests
url = "https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.encoding)
print(datetime.now())
enc = r.apparent_encoding
print(enc)
print(datetime.now())
print(r.text)
print(datetime.now())
r.encoding = enc
print(r.text)
print(datetime.now())
of course the output may get lost in the printing, so I recommend you run the above in an interactive shell, it may become more aparent where you are losing the time even without printing datetime.now()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With