I have a dataframe with about 1M lines, and 3 columns (sentence, a string in the 100-char range, lang, a 3-char string, and i_sent, an int).
I'm trying to generate a new series using a function called compute_coverage, which takes in a sentences and its corresponding language, and returns a float:
absolute_coverage = df.apply(lambda x: compute_coverage(x['sentence'], x['lang']),
axis=1)
compute_coverage is a fairly simple function, but generating the series takes a long time (about 50s). After profiling (results below), it turns out that a large marjority of the time is spent in pandas' get_value function, presumably to get x['sentence'] and x['lang'].
Am I doing this horribly wrong? Is this expected? Is there a better way to perform an row-wise operation?
Thanks!
Edit:
I guess what I'm coming at is is there a way to avoid calling get_value()? For instance, if I do
x = df.apply({'sentence': lambda x: compute_coverage(x, 'fra')})
(which obviously returns incorrect results, but performs the same amount of computation), run time drops by 90%.
Function body:
def compute_coverage(sentence, lang):
words = sentence.split()
return len(set(words)) / (lang_vocab[lang] * len(words))
and lang_vocab is an 8-element dictionary.
120108317 function calls (114648864 primitive calls) in 150.379 seconds
Ordered by: internal time
List reduced from 141 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
2729722 13.090 0.000 83.294 0.000 base.py:2454(get_value)
1 11.105 11.105 150.064 150.064 {pandas._libs.lib.reduce}
1364861 10.287 0.000 16.268 0.000 <ipython-input-16-0ab58d43622d>:3(compute_coverage)
2729722 8.953 0.000 95.187 0.000 series.py:598(__getitem__)
2729722 7.476 0.000 7.476 0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}
8189190 7.460 0.000 16.088 0.000 {built-in method builtins.getattr}
13648677/8189224 6.484 0.000 9.794 0.000 {built-in method builtins.len}
5459444 6.244 0.000 20.539 0.000 {pandas._libs.lib.values_from_object}
1364864 5.801 0.000 17.845 0.000 series.py:284(_set_axis)
8189277 5.637 0.000 8.747 0.000 {built-in method builtins.isinstance}
This is extracting (get_value) 2 times with one value each
df.apply(lambda x: compute_coverage(x['sentence'], x['lang']),
axis=1)
can be rewritten as
df[['sentence', 'lang']].apply(lambda x: compute_coverage(*x))
It is be faster as both values are selected in one attempt (this is further unpacked and passed as parameters to compute_coverage function).
With 100,000 rows data frame this first approach took 7.77s, and for the same data second approach took 4.78s. The second approach seems to be 40% faster.
df = pd.DataFrame({'a':list('abcd')*100000,
'b':list(range(4))*100000,
'c': list(range(3,7))*100000
})
def f(x, y):
return str(x)+str(y)
df.apply(lambda x: f(x['a'], x['b']), axis=1) took 7.66 s
df[['a', 'b']].apply(lambda x: f(*x), axis=1) took 4.67 s
df.apply(lambda x: f(*x[['a', 'b']]), axis=1) took 1min 54s
running time measured using %%timeit in jupyter notebook (python3)
After looking around, it looks like
x = pd.Series(map(lambda x: compute_coverage(x[0], x[1]),
zip(df.sentence, df.lang)))
takes 9s, 7 of which are spent inside compute_coverage, so it looks like it can't get much better without optimizing that function.
It's probably not the best way to do it, but it works well enough in the meanwhile.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With