I am using multiple threads to access and delete data in my pandas dataframe. Because of this, I am wondering is pandas dataframe threadsafe?
No, pandas is not thread safe.
First, pandas is single threaded, meaning that it cannot leverage multiple cores in a machine or cluster. Second, pandas operates entirely in memory and is inefficient in so – leading to disruptive out-of-memory errors.
Python is not by its self thread safe. But there are moves to change this: NoGil, etc. Removing the GIL does not make functions thread-safe.
Thread safety is a concept that describes work with multithreaded programs. Code is considered to be thread-safe if it can work normally with multiple threads. For example, print function is not thread-safe.
No, pandas is not thread safe. And its not thread safe in surprising ways.
Fuggedaboutit! Nope. And generally no. Not even for GIL-locked python datastructures.
Definitely not. There's a long standing open issue: https://github.com/pandas-dev/pandas/issues/2728
Actually I think this is pretty reasonable (i.e. expected) behavior. I wouldn't expect to be able to simultaneouls write and read from, or copy, any datastructure unless either: i) it had been designed for concurrency, or ii) I have an exclusive lock on that object and all the view objects derived from it (.loc
, .iloc
are views and pandas has may others).
For almost all data structures in Python, the answer is yes. For pandas, no. And it seems, its not a design goal at present.
Typically, you can perform 'reading' operations on objects if no-one is performing mutating operations. You have to be a little cautious though. Some datastructures, including pandas, perform memoization, to cache expensive operations that are otherwise functionally pure. Its generally easy to implement lockless memoization in Python:
@property
def thing(self):
if _thing is MISSING:
self._thing = self._calc_thing()
return self._thing
... it simple and safe (assuming assignment is safely atomic -- which has not always been the case for every language, but is in CPython, unless you override __setattribute__
).
Pandas, series and dataframe indexes are computed lazily, on first use. I hope (but I do not see guarantees in the docs), that they're done in a similar safe way.
For all libraries (including pandas) I would hope that all types of read-only operations (or more specifically, 'functionally pure' operations) would be thread safe if no-one is performing mutating operations. I think this is a 'reasonable' easily-achievable, common, lower-bar for thread safeness.
For pandas, however, you cannot assume this. Even if you can guarantee no-one is performing 'functionally impure' operations on your object (e.g. writing to cells, adding/deleting columns'), pandas is not thread safe.
Here's a recent example: https://github.com/pandas-dev/pandas/issues/25870 (its marked as a duplicate of the .copy-not-threadsafe issue, but it seems it could be a separate issue).
s = pd.Series(...)
f(s) # Success!
# Thread 1:
while True: f(s)
# Thread 2:
while True: f(s) # Exception !
... fails for f(s): s.reindex(..., copy=True)
, which returns it's result a as new object -- you would think it would be functionally pure and thread safe. Unfortunately, it is not.
The result of this is that we could not use pandas in production for our healthcare analytics system - and I now discourage it for internal development since it makes in-memory parallelization of read-only operations unsafe. (!!)
The reindex
behavior is weird and surprising. If anyone has ideas about why it fails, please answer here: What's the source of thread-unsafety in this usage of pandas.Series.reindex(, copy=True)?
The maintainers marked this as a duplicate of https://github.com/pandas-dev/pandas/issues/2728 . I'm suspicious, but if .copy
is the source, then almost all of pandas is not thread safe in any situation (which is their advice).
!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With