I have a dataframe that I'd like to expand with a new column which would contain/match the list of all ids if they fully contain the row string_value
id string_value
1 The quick brown fox
2 The quick brown fox jumps
3 The quick brown fox jumps over
4 The quick brown fox jumps over the lazy dog
5 The slow
6 The slow brown fox
Desired output
id string_value new_columns
1 The quick brown fox [2, 3, 4]
2 The quick brown fox jumps [3, 4]
3 The quick brown fox jumps over [4]
4 The quick brown fox jumps over the lazy dog []
5 The slow [6]
6 The slow brown fox []
Thanks
You can't easily vectorize this, but you can use a custom function:
def accumulate(s):
ref = None
prev = s.index[0]
out = {}
for i, val in s.items():
if ref and val.startswith(ref):
tmp.append(prev)
else:
tmp = []
ref = val
prev = i
out[i] = tmp.copy()
# invert dictionary
out2 = {}
for v,l in out.items():
for k in l:
out2.setdefault(k, []).append(v)
return pd.Series(out2)
df['new_columns'] = df['id'].map(accumulate(df.set_index('id')['string_value'].sort_values()))
output:
id string_value new_columns
0 1 The quick brown fox [2, 3, 4]
1 2 The quick brown fox jumps [3, 4]
2 3 The quick brown fox jumps over [4]
3 4 The quick brown fox jumps over the lazy dog NaN
4 5 The slow [6]
5 6 The slow brown fox NaN
to have empty lists in the output in place of NaN, change the "invert dictionary" code to:
# invert dictionary
out2 = {i: [] for i in s.index}
for v,l in out.items():
for k in l:
out2[k].append(v)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With