The following reproducible code produces an example data set that mimics my data on a much smaller scale.
import numpy as np
import pandas as pd
np.random.seed(142536)
df = pd.DataFrame({
"vals": list(np.arange(12).reshape(3,4)),
"idx" : list(np.random.choice([True, False], 12).reshape(3,4))})
df
idx vals
0 [False, True, True, False] [0, 1, 2, 3]
1 [True, True, False, True] [4, 5, 6, 7]
2 [False, True, False, False] [8, 9, 10, 11]
The following reproducible code returns the results I want, but is very inefficient for large data sets.
How would I do this more efficiently?
sel = []
for i in range(len(df.vals)):
sel.append(df.vals[i][df.idx[i]])
df['sel'] = sel
df
idx vals sel
0 [False, True, True, False] [0, 1, 2, 3] [1, 2]
1 [True, True, False, True] [4, 5, 6, 7] [4, 5, 7]
2 [False, True, False, False] [8, 9, 10, 11] [9]
I have tried np.apply_along_axis()
, np.where()
, df.apply()
, and df.transform()
, but can't get any of them to work for this case without errors.
The premise is bad because you shouldn't store data like this. You can at least speed this up by joining your data with itertools.chain
, indexing, and then splitting the result with np.array_split
.
from itertools import chain
fn = lambda x: np.array(list(chain.from_iterable(x)))
df['sel'] = np.array_split(
fn(df.vals)[fn(df.idx)], np.cumsum([sum(x) for x in df.idx][:-1]))
idx vals sel
0 [True, False, True, False] [0, 1, 2, 3] [0, 2]
1 [False, False, False, True] [4, 5, 6, 7] [7]
2 [False, True, True, False] [8, 9, 10, 11] [9, 10]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With