Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe - Performance of get_value in apply

Tags:

python

pandas

I have a dataframe with about 1M lines, and 3 columns (sentence, a string in the 100-char range, lang, a 3-char string, and i_sent, an int).

I'm trying to generate a new series using a function called compute_coverage, which takes in a sentences and its corresponding language, and returns a float:

absolute_coverage = df.apply(lambda x: compute_coverage(x['sentence'], x['lang']),
                             axis=1)

compute_coverage is a fairly simple function, but generating the series takes a long time (about 50s). After profiling (results below), it turns out that a large marjority of the time is spent in pandas' get_value function, presumably to get x['sentence'] and x['lang'].

Am I doing this horribly wrong? Is this expected? Is there a better way to perform an row-wise operation?

Thanks!


Edit:

I guess what I'm coming at is is there a way to avoid calling get_value()? For instance, if I do

x = df.apply({'sentence': lambda x: compute_coverage(x, 'fra')})

(which obviously returns incorrect results, but performs the same amount of computation), run time drops by 90%.

Function body:

def compute_coverage(sentence, lang):
    words = sentence.split()
    return len(set(words)) / (lang_vocab[lang] * len(words))

and lang_vocab is an 8-element dictionary.


         120108317 function calls (114648864 primitive calls) in 150.379 seconds

   Ordered by: internal time
   List reduced from 141 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2729722   13.090    0.000   83.294    0.000 base.py:2454(get_value)
        1   11.105   11.105  150.064  150.064 {pandas._libs.lib.reduce}
  1364861   10.287    0.000   16.268    0.000 <ipython-input-16-0ab58d43622d>:3(compute_coverage)
  2729722    8.953    0.000   95.187    0.000 series.py:598(__getitem__)
  2729722    7.476    0.000    7.476    0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}
  8189190    7.460    0.000   16.088    0.000 {built-in method builtins.getattr}
13648677/8189224    6.484    0.000    9.794    0.000 {built-in method builtins.len}
  5459444    6.244    0.000   20.539    0.000 {pandas._libs.lib.values_from_object}
  1364864    5.801    0.000   17.845    0.000 series.py:284(_set_axis)
  8189277    5.637    0.000    8.747    0.000 {built-in method builtins.isinstance}
like image 875
zale Avatar asked Jan 27 '26 05:01

zale


2 Answers

This is extracting (get_value) 2 times with one value each

df.apply(lambda x: compute_coverage(x['sentence'], x['lang']),
                         axis=1)

can be rewritten as

df[['sentence', 'lang']].apply(lambda x: compute_coverage(*x))

It is be faster as both values are selected in one attempt (this is further unpacked and passed as parameters to compute_coverage function).

With 100,000 rows data frame this first approach took 7.77s, and for the same data second approach took 4.78s. The second approach seems to be 40% faster.


For my data frame with 100,000 records
df = pd.DataFrame({'a':list('abcd')*100000, 
                   'b':list(range(4))*100000, 
                   'c': list(range(3,7))*100000
                  })
def f(x, y):
    return str(x)+str(y)

df.apply(lambda x: f(x['a'], x['b']), axis=1) took 7.66 s
df[['a', 'b']].apply(lambda x: f(*x), axis=1) took 4.67 s
df.apply(lambda x: f(*x[['a', 'b']]), axis=1) took 1min 54s

running time measured using %%timeit in jupyter notebook (python3)

like image 82
shanmuga Avatar answered Jan 29 '26 17:01

shanmuga


After looking around, it looks like

x = pd.Series(map(lambda x: compute_coverage(x[0], x[1]),
                  zip(df.sentence, df.lang)))

takes 9s, 7 of which are spent inside compute_coverage, so it looks like it can't get much better without optimizing that function.

It's probably not the best way to do it, but it works well enough in the meanwhile.

like image 42
zale Avatar answered Jan 29 '26 19:01

zale