Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply function slow in dataframe

I have a dataframe that looks like this:

>> df      
  A
0 [{k1:v1, k2:v2}, {k1:v3, k2:v4}]
1 [{k1:v5, k2:v6}, {k1:v7, k2:v8}, {k1:v9, k2:v10}]

that is column A is a list of dicts with same keys

and I want to extract the values corresponding to the first dict in those lists:

  K1 K2 A
0 v1 v2 ...
1 v5 v6 ...

my solution so far works but is particularly slow (> 1min for ~50K records):

def extract_first_dict(s):
    s['K1'] = s['A'][0]['k1']
    s['K2'] = s['A'][0]['k2']
    return s
df = df.apply(extract_first_dict, axis = 1)

Anybody could suggest a better, faster way to do this? Thanks!

like image 365
fricadelle Avatar asked Mar 07 '23 04:03

fricadelle


1 Answers

Option 1

You should find pd.Series.apply more efficient than pd.DataFrame.apply, as you are using only one series as an input.

def extract_first(x):
    return list(x[0].values())

df['B'] = df['A'].apply(extract_first)

Option 2

You can also try using a list comprehension:

df['B'] = [list(x[0].values()) for x in df['A']]

In both the above cases, you can split into 2 columns via:

df[['C', 'D']] = df['B'].apply(pd.Series)

You should benchmark with your data to assess whether either of these options are fast enough for your use case.

But really...

Look upstream to get your data in a more usable format. pandas will offer no vectorised functionality on a series of dictionaries. You should consider using just a list of dictionaries.

like image 59
jpp Avatar answered Mar 20 '23 03:03

jpp