Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get a list of rows starting from the same value as current row in pandas dataframe

I have a dataframe that I'd like to expand with a new column which would contain/match the list of all ids if they fully contain the row string_value

id  string_value
1   The quick brown fox 
2   The quick brown fox jumps  
3   The quick brown fox jumps over 
4   The quick brown fox jumps over the lazy dog
5   The slow 
6   The slow brown fox 

Desired output

id  string_value                                new_columns
1   The quick brown fox                         [2, 3, 4]
2   The quick brown fox jumps                   [3, 4]
3   The quick brown fox jumps over              [4]
4   The quick brown fox jumps over the lazy dog []
5   The slow                                    [6]
6   The slow brown fox                          []

Thanks

like image 276
Nadiia Avatar asked Jan 21 '26 20:01

Nadiia


1 Answers

You can't easily vectorize this, but you can use a custom function:

def accumulate(s):
    ref = None
    prev = s.index[0]
    out = {}
    for i, val in s.items():
        if ref and val.startswith(ref):
            tmp.append(prev)
        else:
            tmp = []
        ref = val
        prev = i
        out[i] = tmp.copy()

    # invert dictionary
    out2 = {}
    for v,l in out.items():
        for k in l:
            out2.setdefault(k, []).append(v)
    
    return pd.Series(out2)

df['new_columns'] = df['id'].map(accumulate(df.set_index('id')['string_value'].sort_values()))

output:

   id                                 string_value new_columns
0   1                          The quick brown fox   [2, 3, 4]
1   2                    The quick brown fox jumps      [3, 4]
2   3               The quick brown fox jumps over         [4]
3   4  The quick brown fox jumps over the lazy dog         NaN
4   5                                     The slow         [6]
5   6                           The slow brown fox         NaN

empty lists

to have empty lists in the output in place of NaN, change the "invert dictionary" code to:

    # invert dictionary
    out2 = {i: [] for i in s.index}
    for v,l in out.items():
        for k in l:
            out2[k].append(v)
like image 65
mozway Avatar answered Jan 23 '26 19:01

mozway



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!