How to make requests_cache work with many concurrent requests?

Question

I'm getting and caching (for performance) lots of URLs with something like:

import requests
import requests_cache
from multiprocessing.pool import ThreadPool

urls = ['http://www.google.com', ...]
with requests_cache.enabled():
    responses = ThreadPool(100).map(requests.get, urls)

However, I'm getting a lot of errors for:

sqlite3.OperationalError: database is locked

Clearly too many threads are accessing the cache at the same time.

So does requests_cache support some kind of transaction so that the write only occurs when all the threads have finished? E.g.

with requests_cache.enabled():
    with requests_cache.transaction():
        responses = ThreadPool(100).map(requests.get, urls)

Jordan · Accepted Answer

Since requests.cache.enabled() and its related functions use monkey-patching, it's unfortunately not thread-safe.

Fortunately, the underlying class that does all the actual caching (CachedSession) is thread-safe as of requests-cache 0.6+ (and improved some more in 0.7+), so that's probably what you want to use here. There's a full example using ThreadPoolExecutor here: https://github.com/reclosedev/requests-cache/blob/master/examples/threads.py

Like the other answer mentioned, Redis is going to be a better option for concurrent requests, but it's not entirely necessary. SQLite handles concurrency well enough; it supports unlimited concurrent reads, but concurrent writes are internally queued up and run in serial. In a lot of cases this is still fast enough that you won't even notice, but if you're doing really large volumes of concurrent writes, then Redis or one of the other backends is going to be better optimized for that.

Chris · Answer

I'm having a Django-Rest-Framework application. It works perfectly fine, until requests come in simultaneously. When that happens, the app sometimes starts throwing database is locked errors. My first guess was, that the Django-db is overloaded and needed to be replaced with something beefier.

Reproducing the problem by running parallel requests with curl from bash (see here) gave me fresh logs and traces. I found that requests-cache runs into problems when cleaning out its database. It was configured to cache for 600 seconds, so the first batch-run after the cache was populated would always fail:

...
File "/opt/app/lib/python3.5/site-packages/requests_cache/core.py" in remove_expired_responses
159.         self.cache.remove_old_entries(datetime.utcnow() - self._cache_expire_after)

File "/opt/app/lib/python3.5/site-packages/requests_cache/backends/base.py" in remove_old_entries
117.             self.delete(key)

File "/opt/app/lib/python3.5/site-packages/requests_cache/backends/base.py" in delete
83.                 del self.responses[key]

File "/opt/app/lib/python3.5/site-packages/requests_cache/backends/storage/dbdict.py" in __delitem__
130.                               self.table_name, (key,))

Exception Type: OperationalError at /app/v1/invitations/
Exception Value: database is locked

Looking into possible solutions, I found that Redis could be used as backend. I installed Redis and ran it for localhost only. Simply setting the cache-config's backend from sqlite to 'redis' fixed the problem.

I feel a bit like I am fixing a loose bolt with a hammer, but I'm happy I got it working without breaking anything. I am sure that someone would be able to find a better, more elegant solution, like passing an sqlite-config-param through requests-cache or a code-fix.

How to make requests_cache work with many concurrent requests?

Tags:

python

caching

concurrency

python-requests

mchen

2 Answers

Jordan

Chris

Recent Activity

Donate For Us

How to make requests_cache work with many concurrent requests?

Tags:

python

caching

concurrency

python-requests

mchen

2 Answers

Jordan

Chris

Related questions

Recent Activity

Donate For Us