Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check reference list in pandas column using numpy vectorization

I have a reference list

ref = ['September', 'August', 'July', 'June', 'May', 'April', 'March']

And a dataframe

df = pd.DataFrame({'Month_List': [['July'], ['August'], ['July', 'June'], ['May', 'April', 'March']]})
df
    Month_List
0   [July]
1   [August]
2   [July, June]
3   [May, April, March]

I want to check which elements from reference list is present in each row, and convert into binary list

I can achieve this using apply

def convert_month_to_binary(ref,lst):
    s = pd.Series(ref)
    return s.isin(lst).astype(int).tolist()  

df['Binary_Month_List'] = df['Month_List'].apply(lambda x: convert_month_to_binary(ref, x))
df

    Month_List          Binary_Month_List
0   [July]              [0, 0, 1, 0, 0, 0, 0]
1   [August]            [0, 1, 0, 0, 0, 0, 0]
2   [July, June]        [0, 0, 1, 1, 0, 0, 0]
3   [May, April, March] [0, 0, 0, 0, 1, 1, 1]

However, using apply on large datasets is very slow and hence I am looking to use numpy vectorization. How can I improve my performance?

Extension:

I wanted to use numpy vectorization because I need to now apply another function on this list

I am trying like this, but performance is very slow. Similar results with apply

def count_one(lst):
    index = [i for i, e in enumerate(lst) if e != 0] 
    return len(index)

vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_Month_List']) 
like image 459
Hardik Gupta Avatar asked Sep 27 '19 14:09

Hardik Gupta


3 Answers

We can using explode with get_dummies, notice explode is available after 0.25

df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
Out[79]: 
[[0, 0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 1]]

#df['new']=df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
like image 54
BENY Avatar answered Oct 11 '22 14:10

BENY


In pandas is better not use lists this way, but it is possible with MultiLabelBinarizer and DataFrame.reindex for added missing categories, last convert values to numpy array and then to lists if performance is important:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_)
df['Binary_Month_List'] = df1.reindex(columns=ref, fill_value=0).values.tolist()

Or with Series.str.join, Series.str.get_dummies and reindex:

df['Binary_Month_List'] = (df['Month_List'].str.join('|')
                                           .str.get_dummies()
                                           .reindex(columns=ref, fill_value=0)
                                           .values
                                           .tolist())
print (df)
            Month_List      Binary_Month_List
0               [July]  [0, 0, 1, 0, 0, 0, 0]
1             [August]  [0, 1, 0, 0, 0, 0, 0]
2         [July, June]  [0, 0, 1, 1, 0, 0, 0]
3  [May, April, March]  [0, 0, 0, 0, 1, 1, 1]

Performance is different:

df = pd.concat([df] * 1000, ignore_index=True)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

In [338]: %timeit (df['Month_List'].str.join('|').str.get_dummies().reindex(columns=ref, fill_value=0).values.tolist())
31.4 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [339]: %timeit pd.DataFrame(mlb.fit_transform(df['Month_List']),columns=mlb.classes_).reindex(columns=ref, fill_value=0).values.tolist()
5.57 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [340]: %timeit df['Binary_Month_List2'] =df.Month_List.explode().str.get_dummies().sum(level=0).reindex(columns=ref, fill_value=0).values.tolist()
58.6 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 31
jezrael Avatar answered Oct 11 '22 14:10

jezrael


Here's one with NumPy tools -

def isin_lists(df_col, ref):
    a = np.concatenate(df_col)
    b = np.asarray(ref)

    sidx = b.argsort()
    c = sidx[np.searchsorted(b,a,sorter=sidx)]

    l = np.array([len(i) for i in df_col])
    r = np.repeat(np.arange(len(l)),l)

    out = np.zeros((len(l),len(b)), dtype=bool)
    out[r,c] = 1
    return out.view('i1')

Output for given sample -

In [79]: bin_ar = isin_lists(df['Month_List'], ref)

In [80]: bin_ar
Out[80]: 
array([[0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1]], dtype=int8)

# To assign as lists for each row into `df`
In [81]: df['Binary_Month_List'] = bin_ar.tolist()

# To get counts
In [82]: df['Value'] = bin_ar.sum(1)

In [83]: df
Out[83]: 
            Month_List      Binary_Month_List  Value
0               [July]  [0, 0, 1, 0, 0, 0, 0]      1
1             [August]  [0, 1, 0, 0, 0, 0, 0]      1
2         [July, June]  [0, 0, 1, 1, 0, 0, 0]      2
3  [May, April, March]  [0, 0, 0, 0, 1, 1, 1]      3

If you can't use the intermediate bin_ar for some reason and have only 'Binary_Month_List' header to work with -

In [15]: df['Value'] = np.vstack(df['Binary_Month_List']).sum(axis=1)
like image 4
Divakar Avatar answered Oct 11 '22 14:10

Divakar