I am wondering if there is a way to do a pandas dataframe apply function in parallel. I have looked around and haven't found anything. At least in theory I think it should be fairly simple to implement but haven't seen anything. This is practically the textbook definition of parallel after all.. Has anyone else tried this or know of a way? If no one has any ideas I think I might just try writing it myself.
The code I am working with is below. Sorry for the lack of import statements. They are mixed in with a lot of other things.
def apply_extract_entities(row):
names=[]
counter=0
print row
for sent in nltk.sent_tokenize(open(row['file_name'], "r+b").read()):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'node'):
names+= [chunk.node, ' '.join(c[0] for c in chunk.leaves())]
counter+=1
print counter
return names
data9_2['proper_nouns']=data9_2.apply(apply_extract_entities, axis=1)
EDIT:
So here is what I tried. I tried running it with just the first five element of my iterable and it is taking longer than it would if I ran it serially so I assume it is not working.
os.chdir(str(home))
data9_2=pd.read_csv('edgarsdc3.csv')
os.chdir(str(home)+str('//defmtest'))
#import stuff
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
#define apply function and apply it
os.chdir(str(home)+str('//defmtest'))
####
#this is our apply function
def apply_extract_entities(row):
names=[]
counter=0
print row
for sent in nltk.sent_tokenize(open(row['file_name'], "r+b").read()):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'node'):
names+= [chunk.node, ' '.join(c[0] for c in chunk.leaves())]
counter+=1
print counter
return names
#need something that populates a list of sections of a dataframe
def dataframe_splitter(df):
df_list=range(len(df))
for i in xrange(len(df)):
sliced=df.ix[i]
df_list[i]=sliced
return df_list
df_list=dataframe_splitter(data9_2)
#df_list=range(len(data9_2))
print df_list
#the multiprocessing section
import multiprocessing
def worker(arg):
print arg
(arg)['proper_nouns']=arg.apply(apply_extract_entities, axis=1)
return arg
pool = multiprocessing.Pool(processes=10)
# get list of pieces
res = pool.imap_unordered(worker, df_list[:5])
res2= list(itertools.chain(*res))
pool.close()
pool.join()
# re-assemble pieces into the final output
output = data9_2.head(1).concatenate(res)
print output.head()
With multiprocessing, it's best to generate several large blocks of data, then re-assemble them to produce the final output.
import multiprocessing
def worker(arg):
return arg*2
pool = multiprocessing.Pool()
# get list of pieces
res = pool.map(worker, [1,2,3])
pool.close()
pool.join()
# re-assemble pieces into the final output
output = sum(res)
print 'got:',output
got: 12
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With