Pandas dataframe - Performance of get_value in apply

Question

I have a dataframe with about 1M lines, and 3 columns (sentence, a string in the 100-char range, lang, a 3-char string, and i_sent, an int).

I'm trying to generate a new series using a function called compute_coverage, which takes in a sentences and its corresponding language, and returns a float:

absolute_coverage = df.apply(lambda x: compute_coverage(x['sentence'], x['lang']),
                             axis=1)

compute_coverage is a fairly simple function, but generating the series takes a long time (about 50s). After profiling (results below), it turns out that a large marjority of the time is spent in pandas' get_value function, presumably to get x['sentence'] and x['lang'].

Am I doing this horribly wrong? Is this expected? Is there a better way to perform an row-wise operation?

Thanks!

Edit:

I guess what I'm coming at is is there a way to avoid calling get_value()? For instance, if I do

x = df.apply({'sentence': lambda x: compute_coverage(x, 'fra')})

(which obviously returns incorrect results, but performs the same amount of computation), run time drops by 90%.

Function body:

def compute_coverage(sentence, lang):
    words = sentence.split()
    return len(set(words)) / (lang_vocab[lang] * len(words))

and lang_vocab is an 8-element dictionary.

         120108317 function calls (114648864 primitive calls) in 150.379 seconds

   Ordered by: internal time
   List reduced from 141 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2729722   13.090    0.000   83.294    0.000 base.py:2454(get_value)
        1   11.105   11.105  150.064  150.064 {pandas._libs.lib.reduce}
  1364861   10.287    0.000   16.268    0.000 <ipython-input-16-0ab58d43622d>:3(compute_coverage)
  2729722    8.953    0.000   95.187    0.000 series.py:598(__getitem__)
  2729722    7.476    0.000    7.476    0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}
  8189190    7.460    0.000   16.088    0.000 {built-in method builtins.getattr}
13648677/8189224    6.484    0.000    9.794    0.000 {built-in method builtins.len}
  5459444    6.244    0.000   20.539    0.000 {pandas._libs.lib.values_from_object}
  1364864    5.801    0.000   17.845    0.000 series.py:284(_set_axis)
  8189277    5.637    0.000    8.747    0.000 {built-in method builtins.isinstance}

shanmuga · Accepted Answer

This is extracting (get_value) 2 times with one value each

df.apply(lambda x: compute_coverage(x['sentence'], x['lang']),
                         axis=1)

can be rewritten as

df[['sentence', 'lang']].apply(lambda x: compute_coverage(*x))

It is be faster as both values are selected in one attempt (this is further unpacked and passed as parameters to compute_coverage function).

With 100,000 rows data frame this first approach took 7.77s, and for the same data second approach took 4.78s. The second approach seems to be 40% faster.

For my data frame with 100,000 records

df = pd.DataFrame({'a':list('abcd')*100000, 
                   'b':list(range(4))*100000, 
                   'c': list(range(3,7))*100000
                  })
def f(x, y):
    return str(x)+str(y)

df.apply(lambda x: f(x['a'], x['b']), axis=1) took 7.66 s
df[['a', 'b']].apply(lambda x: f(*x), axis=1) took 4.67 s
df.apply(lambda x: f(*x[['a', 'b']]), axis=1) took 1min 54s

running time measured using %%timeit in jupyter notebook (python3)

zale · Answer

After looking around, it looks like

x = pd.Series(map(lambda x: compute_coverage(x[0], x[1]),
                  zip(df.sentence, df.lang)))

takes 9s, 7 of which are spent inside compute_coverage, so it looks like it can't get much better without optimizing that function.

It's probably not the best way to do it, but it works well enough in the meanwhile.

Pandas dataframe - Performance of get_value in apply

Tags:

python

pandas

zale

2 Answers

shanmuga

zale

Recent Activity

Donate For Us

Pandas dataframe - Performance of get_value in apply

Tags:

python

pandas

zale

2 Answers

shanmuga

zale

Related questions

Recent Activity

Donate For Us