I'm getting and caching (for performance) lots of URLs with something like:
import requests
import requests_cache
from multiprocessing.pool import ThreadPool
urls = ['http://www.google.com', ...]
with requests_cache.enabled():
responses = ThreadPool(100).map(requests.get, urls)
However, I'm getting a lot of errors for:
sqlite3.OperationalError: database is locked
Clearly too many threads are accessing the cache at the same time.
So does requests_cache
support some kind of transaction so that the write only occurs when all the threads have finished? E.g.
with requests_cache.enabled():
with requests_cache.transaction():
responses = ThreadPool(100).map(requests.get, urls)
Since requests.cache.enabled()
and its related functions use monkey-patching, it's unfortunately not thread-safe.
Fortunately, the underlying class that does all the actual caching (CachedSession
) is thread-safe as of requests-cache 0.6+ (and improved some more in 0.7+), so that's probably what you want to use here. There's a full example using ThreadPoolExecutor
here: https://github.com/reclosedev/requests-cache/blob/master/examples/threads.py
Like the other answer mentioned, Redis is going to be a better option for concurrent requests, but it's not entirely necessary. SQLite handles concurrency well enough; it supports unlimited concurrent reads, but concurrent writes are internally queued up and run in serial. In a lot of cases this is still fast enough that you won't even notice, but if you're doing really large volumes of concurrent writes, then Redis or one of the other backends is going to be better optimized for that.
I'm having a Django-Rest-Framework application. It works perfectly fine, until requests come in simultaneously. When that happens, the app sometimes starts throwing database is locked
errors. My first guess was, that the Django-db is overloaded and needed to be replaced with something beefier.
Reproducing the problem by running parallel requests with curl from bash (see here) gave me fresh logs and traces. I found that requests-cache runs into problems when cleaning out its database. It was configured to cache for 600 seconds, so the first batch-run after the cache was populated would always fail:
...
File "/opt/app/lib/python3.5/site-packages/requests_cache/core.py" in remove_expired_responses
159. self.cache.remove_old_entries(datetime.utcnow() - self._cache_expire_after)
File "/opt/app/lib/python3.5/site-packages/requests_cache/backends/base.py" in remove_old_entries
117. self.delete(key)
File "/opt/app/lib/python3.5/site-packages/requests_cache/backends/base.py" in delete
83. del self.responses[key]
File "/opt/app/lib/python3.5/site-packages/requests_cache/backends/storage/dbdict.py" in __delitem__
130. self.table_name, (key,))
Exception Type: OperationalError at /app/v1/invitations/
Exception Value: database is locked
Looking into possible solutions, I found that Redis
could be used as backend. I installed Redis and ran it for localhost only. Simply setting the cache-config's backend
from sqlite
to 'redis' fixed the problem.
I feel a bit like I am fixing a loose bolt with a hammer, but I'm happy I got it working without breaking anything. I am sure that someone would be able to find a better, more elegant solution, like passing an sqlite-config-param through requests-cache
or a code-fix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With