Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas dataframe thread safe?

I am using multiple threads to access and delete data in my pandas dataframe. Because of this, I am wondering is pandas dataframe threadsafe?

like image 788
Andrew Avatar asked Nov 27 '12 20:11

Andrew


People also ask

Is pandas DataFrame thread-safe?

No, pandas is not thread safe.

Is pandas single threaded?

First, pandas is single threaded, meaning that it cannot leverage multiple cores in a machine or cluster. Second, pandas operates entirely in memory and is inefficient in so – leading to disruptive out-of-memory errors.

Is Python threading thread-safe?

Python is not by its self thread safe. But there are moves to change this: NoGil, etc. Removing the GIL does not make functions thread-safe.

What is thread-safe in Python?

Thread safety is a concept that describes work with multithreaded programs. Code is considered to be thread-safe if it can work normally with multiple threads. For example, print function is not thread-safe.


1 Answers

No, pandas is not thread safe. And its not thread safe in surprising ways.

  • Can I delete from pandas dataframe while another thread is using?

Fuggedaboutit! Nope. And generally no. Not even for GIL-locked python datastructures.

  • Can I read from a pandas object while someone else is writing to it?
  • Can I copy a pandas dataframe in my thread, and work on the copy?

Definitely not. There's a long standing open issue: https://github.com/pandas-dev/pandas/issues/2728

Actually I think this is pretty reasonable (i.e. expected) behavior. I wouldn't expect to be able to simultaneouls write and read from, or copy, any datastructure unless either: i) it had been designed for concurrency, or ii) I have an exclusive lock on that object and all the view objects derived from it (.loc, .iloc are views and pandas has may others).

  • Can I read from a pandas object while no-one else is writing to it?

For almost all data structures in Python, the answer is yes. For pandas, no. And it seems, its not a design goal at present.

Typically, you can perform 'reading' operations on objects if no-one is performing mutating operations. You have to be a little cautious though. Some datastructures, including pandas, perform memoization, to cache expensive operations that are otherwise functionally pure. Its generally easy to implement lockless memoization in Python:

@property
def thing(self):
    if _thing is MISSING:
        self._thing = self._calc_thing()
    return self._thing

... it simple and safe (assuming assignment is safely atomic -- which has not always been the case for every language, but is in CPython, unless you override __setattribute__).

Pandas, series and dataframe indexes are computed lazily, on first use. I hope (but I do not see guarantees in the docs), that they're done in a similar safe way.

For all libraries (including pandas) I would hope that all types of read-only operations (or more specifically, 'functionally pure' operations) would be thread safe if no-one is performing mutating operations. I think this is a 'reasonable' easily-achievable, common, lower-bar for thread safeness.

For pandas, however, you cannot assume this. Even if you can guarantee no-one is performing 'functionally impure' operations on your object (e.g. writing to cells, adding/deleting columns'), pandas is not thread safe.

Here's a recent example: https://github.com/pandas-dev/pandas/issues/25870 (its marked as a duplicate of the .copy-not-threadsafe issue, but it seems it could be a separate issue).

s = pd.Series(...)
f(s)  # Success!

# Thread 1:
   while True: f(s)  

# Thread 2:
   while True: f(s)  # Exception !

... fails for f(s): s.reindex(..., copy=True), which returns it's result a as new object -- you would think it would be functionally pure and thread safe. Unfortunately, it is not.

The result of this is that we could not use pandas in production for our healthcare analytics system - and I now discourage it for internal development since it makes in-memory parallelization of read-only operations unsafe. (!!)

The reindex behavior is weird and surprising. If anyone has ideas about why it fails, please answer here: What's the source of thread-unsafety in this usage of pandas.Series.reindex(, copy=True)?

The maintainers marked this as a duplicate of https://github.com/pandas-dev/pandas/issues/2728 . I'm suspicious, but if .copy is the source, then almost all of pandas is not thread safe in any situation (which is their advice).

!

like image 113
user48956 Avatar answered Sep 27 '22 20:09

user48956