Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient python data transformation with pandas/numpy

I got df like this:

cols=['a', 'b']
df = pd.DataFrame([[[3,1,5,7], [42,31]], [[],[44]], [[44,3,5,5,5,10],[]], [[], [44324,3]]],
                   columns=cols)

As you see theres list in every cell. I want to to followings things on each of element:

  1. Calculate mean of list and append 5
  2. If result <= 0, add 1 in place of list
  3. If list is empty, add 0 in place of list

My working solution:

df
def convert_list(x):
    if len(x) != 0:
        res = (sum(x)/len(x)) + 5
        if res <= 0:
            res = 1
        return res
    return 0

for col in cols:
    df[col] = df[col].apply(lambda x: convert_list(x))

Desired output:

df

It's working but its very slow solution (in original df I got about 50k columns and 100k rows, and list might contains many elements). Is there any efficient solution for this? I also tried convert it to numpy array and do some vectecorized operations, but the problem is every list might have different length, so I cant convert it (unless I add many elements to other lists...)

like image 767
Thomas Kodill Avatar asked Apr 13 '26 21:04

Thomas Kodill


2 Answers

You could use applymap and np.mean to average every cell and add 5. Then any value below 5 would have been a negative mean, and nans can be filled with zero.

import pandas as pd
import numpy as np
cols=['a', 'b']
df = pd.DataFrame([[[3,1,5,7], [42,31]], [[],[44]], [[44,3,5,5,5,10],[]], [[], [44324,3]]],
                   columns=cols)


df = (df.applymap(np.mean)+5)
df[df<5]=1
df = df.fillna(0)

Output

      a        b
0   9.0     41.5
1   0.0     49.0
2  17.0      0.0
3   0.0  22168.5
like image 106
Chris Avatar answered Apr 16 '26 09:04

Chris


You could try using explode then an aggregation, with the idea of avoiding apply.

Something like that, but it is ugly.

def correct_column(col):
    cole = col.explode()
    m_empty = cole.isna().groupby(cole.index).first()
    groups = cole.groupby(cole.index)
    col_sum = groups.sum()
    col_count = groups.count()
    col = groups.sum() / groups.count() + 5
    m_negative = col <= 0
    col[m_empty] = 0
    col[m_negative] = 1
    return col

for col in df.columns:
    df[col] = correct_column(df[col])

Note: groups.sum() / groups.count() may be strange way to get the mean but for empty lists (NaN after explode) mean raises an error when sum returns 0.

like image 39
hpchavaz Avatar answered Apr 16 '26 10:04

hpchavaz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!