Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe in multiple threads

Can someone tell me a way to add data into pandas dataframe in python while multiple threads are going to use a function in which data has to be appended into a dataframe...?

My code scrapes data from a URL and then i was using df.loc[index]... to add the scrapped row into the dataframe.

Since I've started a multi thread which basically assigns each URL to each thread. So in short many pages are being scraped at once...

How do I append those rows into the dataframe?

like image 622
Yasir Azeem Avatar asked Dec 02 '16 18:12

Yasir Azeem


People also ask

Does pandas support multithreading?

However, most machine learning and scientific libraries used by data scientists (Numpy, Pandas, scikit-learn, and so on) release the GIL, effectively allowing multithreaded execution on separate workers.

Are pandas Dataframes thread safe?

pandas is not 100% thread safe. The known issues relate to the copy() method. If you are doing a lot of copying of DataFrame objects shared among threads, we recommend holding locks inside the threads where the data copying occurs.

Is pandas single threaded?

First, pandas is single threaded, meaning that it cannot leverage multiple cores in a machine or cluster.

Does Python allow multi threading?

Python doesn't support multi-threading because Python on the Cpython interpreter does not support true multi-core execution via multithreading. However, Python does have a threading library. The GIL does not prevent threading.


1 Answers

Adding rows to dataframes one-by-one is not recommended. I suggest you build your data in lists, then combine those lists at the end, and then only call the DataFrame constructor once at the end on the full data set.

Example:

# help from http://stackoverflow.com/a/28463266/3393459
# and http://stackoverflow.com/a/2846697/3393459


from multiprocessing.dummy import Pool as ThreadPool 
import requests
import pandas as pd


pool = ThreadPool(4) 

# called by each thread
def get_web_data(url):
    return {'col1': 'something', 'request_data': requests.get(url).text}


urls = ["http://google.com", "http://yahoo.com"]
results = pool.map(get_web_data, urls)


print results
print pd.DataFrame(results)
like image 122
exp1orer Avatar answered Sep 30 '22 17:09

exp1orer