I'm using pandas
on a web server (apache + modwsgi + django) and have an hard-to-reproduce bug which now I discovered is caused by pandas not being thread-safe.
After a lot of code reduction I finally found a short standalone program which can be used to reproduce the problem. You can see it below.
The point is: contrary to the answer of this question this example shows that pandas can crash even with very simple operations which do not modify a dataframe. I'm not able to imagine how this simple code snippet could possibly be unsafe with threads...
The question is about using pandas and numpy in a web server. Is it possible? How am I supposed to fix my code using pandas? (an example of lock usage would be helpful)
Here is the code which causes a Segmentation Fault:
import threading
import pandas as pd
import numpy as np
def let_crash(crash=True):
t = 0.02 * np.arange(100000) # ok con 10000
data = pd.DataFrame({'t': t})
if crash:
data['t'] * 1.5 # CRASH
else:
data['t'].values * 1.5 # THIS IS OK!
if __name__ == '__main__':
threads = []
for i in range(100):
if True: # asynchronous
t = threading.Thread(target=let_crash, args = ())
t.daemon = True
t.start()
threads.append(t)
else: # synchronous
let_crash()
for t in threads:
t.join()
My environment: python 2.7.3, numpy 1.8.0, pandas 0.13.1
see caveat in the docs here: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#thread-safety
pandas is not thread safe because the underlying copy mechanism is not. Numpy I believe has an atomic copy operation, but pandas has a layer above this.
Copy is the basis of pandas operations (as most operations generate a new object to return to the user)
It is not trivial to fix this and would come with a pretty heavy perf cost so would need a bit of work to deal with this properly.
Easiest is simply not to share objects across threads or lock them on usage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With