Can someone tell me a way to add data into pandas dataframe in python while multiple threads are going to use a function in which data has to be appended into a dataframe...?
My code scrapes data from a URL and then i was using df.loc[index]... to add the scrapped row into the dataframe.
Since I've started a multi thread which basically assigns each URL to each thread. So in short many pages are being scraped at once...
How do I append those rows into the dataframe?
However, most machine learning and scientific libraries used by data scientists (Numpy, Pandas, scikit-learn, and so on) release the GIL, effectively allowing multithreaded execution on separate workers.
pandas is not 100% thread safe. The known issues relate to the copy() method. If you are doing a lot of copying of DataFrame objects shared among threads, we recommend holding locks inside the threads where the data copying occurs.
First, pandas is single threaded, meaning that it cannot leverage multiple cores in a machine or cluster.
Python doesn't support multi-threading because Python on the Cpython interpreter does not support true multi-core execution via multithreading. However, Python does have a threading library. The GIL does not prevent threading.
Adding rows to dataframes one-by-one is not recommended. I suggest you build your data in lists, then combine those lists at the end, and then only call the DataFrame constructor once at the end on the full data set.
Example:
# help from http://stackoverflow.com/a/28463266/3393459
# and http://stackoverflow.com/a/2846697/3393459
from multiprocessing.dummy import Pool as ThreadPool
import requests
import pandas as pd
pool = ThreadPool(4)
# called by each thread
def get_web_data(url):
return {'col1': 'something', 'request_data': requests.get(url).text}
urls = ["http://google.com", "http://yahoo.com"]
results = pool.map(get_web_data, urls)
print results
print pd.DataFrame(results)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With