Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorized indexing numpy arrays in pandas Series with Boolean numpy arrays in pandas Series

The following reproducible code produces an example data set that mimics my data on a much smaller scale.

import numpy as np 
import pandas as pd

np.random.seed(142536)

df = pd.DataFrame({
        "vals": list(np.arange(12).reshape(3,4)),
        "idx" : list(np.random.choice([True, False], 12).reshape(3,4))})
df

                           idx            vals
0   [False, True, True, False]    [0, 1, 2, 3]
1    [True, True, False, True]    [4, 5, 6, 7] 
2  [False, True, False, False]  [8, 9, 10, 11] 

The following reproducible code returns the results I want, but is very inefficient for large data sets.
How would I do this more efficiently?

sel = []
for i in range(len(df.vals)):
    sel.append(df.vals[i][df.idx[i]])

df['sel'] = sel
df

                           idx            vals        sel
0   [False, True, True, False]    [0, 1, 2, 3]     [1, 2]
1    [True, True, False, True]    [4, 5, 6, 7]  [4, 5, 7]
2  [False, True, False, False]  [8, 9, 10, 11]        [9]

I have tried np.apply_along_axis(), np.where(), df.apply(), and df.transform(), but can't get any of them to work for this case without errors.

like image 648
Clay Avatar asked Jan 28 '23 14:01

Clay


1 Answers

The premise is bad because you shouldn't store data like this. You can at least speed this up by joining your data with itertools.chain, indexing, and then splitting the result with np.array_split.

from itertools import chain

fn = lambda x: np.array(list(chain.from_iterable(x)))
df['sel'] = np.array_split(
    fn(df.vals)[fn(df.idx)], np.cumsum([sum(x) for x in df.idx][:-1]))

                           idx            vals      sel
0   [True, False, True, False]    [0, 1, 2, 3]   [0, 2]
1  [False, False, False, True]    [4, 5, 6, 7]      [7]
2   [False, True, True, False]  [8, 9, 10, 11]  [9, 10]
like image 170
cs95 Avatar answered Jan 31 '23 22:01

cs95