Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas and numpy thread safety

I'm using pandas on a web server (apache + modwsgi + django) and have an hard-to-reproduce bug which now I discovered is caused by pandas not being thread-safe.

After a lot of code reduction I finally found a short standalone program which can be used to reproduce the problem. You can see it below.

The point is: contrary to the answer of this question this example shows that pandas can crash even with very simple operations which do not modify a dataframe. I'm not able to imagine how this simple code snippet could possibly be unsafe with threads...

The question is about using pandas and numpy in a web server. Is it possible? How am I supposed to fix my code using pandas? (an example of lock usage would be helpful)

Here is the code which causes a Segmentation Fault:

import threading
import pandas as pd
import numpy as np

def let_crash(crash=True):
    t = 0.02 * np.arange(100000) # ok con 10000                                                                               
    data = pd.DataFrame({'t': t})
    if crash:
        data['t'] * 1.5  # CRASH
    else:
        data['t'].values * 1.5  # THIS IS OK!

if __name__ == '__main__':
        threads = []
        for i in range(100):
            if True:  # asynchronous                                                                                          
                t = threading.Thread(target=let_crash, args = ())
                t.daemon = True
                t.start()
                threads.append(t)
            else:  # synchronous                                                                                              
                let_crash()
        for t in threads:
            t.join()

My environment: python 2.7.3, numpy 1.8.0, pandas 0.13.1

like image 917
Emanuele Paolini Avatar asked Sep 11 '14 08:09

Emanuele Paolini


1 Answers

see caveat in the docs here: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#thread-safety

pandas is not thread safe because the underlying copy mechanism is not. Numpy I believe has an atomic copy operation, but pandas has a layer above this.

Copy is the basis of pandas operations (as most operations generate a new object to return to the user)

It is not trivial to fix this and would come with a pretty heavy perf cost so would need a bit of work to deal with this properly.

Easiest is simply not to share objects across threads or lock them on usage.

like image 176
Jeff Avatar answered Sep 23 '22 09:09

Jeff