Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rolling sum with strings

Tags:

python

pandas

Say I have a dataframe containing strings, such as:

df = pd.DataFrame({'col1':list('some_string')})

    col1
0     s
1     o    
2     m
3     e
4     _
5     s
...

I'm looking for a way to apply a rolling window on col1 and join the strings in a certain window size. Say for instance window=3, I'd like to obtain (with no minimum number of observations):

     col1
0     s
1     so
2     som
3     ome
4     me_
5     e_s
6     _st
7     str
8     tri
9     rin
10    ing

I've tried the obvious solutions with rolling which fail at handling object types:

df.col1.rolling(3, min_periods=0).sum()
df.col1.rolling(3, min_periods=0).apply(''.join)

Both raise:

cannot handle this type -> object

Is there a generalisable approach to do so (not using shift to match this specific case of w=3)?

like image 695
yatu Avatar asked Jun 11 '19 10:06

yatu


3 Answers

How about shifting the series?

df.col1.shift(2).fillna('') + df.col1.shift().fillna('') + df.col1

Generalizing to any number:

pd.concat([df.col1.shift(i).fillna('') for i in range(3)], axis=1).sum(axis=1)
like image 140
IanS Avatar answered Oct 02 '22 22:10

IanS


Rolling works only with numbers:

def _prep_values(self, values=None, kill_inf=True):
        if values is None:
            values = getattr(self._selected_obj, 'values', self._selected_obj)
        # GH #12373 : rolling functions error on float32 data
        # make sure the data is coerced to float64
        if is_float_dtype(values.dtype):
            values = ensure_float64(values)
        elif is_integer_dtype(values.dtype):
            values = ensure_float64(values)
        elif needs_i8_conversion(values.dtype):
            raise NotImplementedError...
    ...
    ...

So you should construct it manually. Here is one of the possible variants with simple list comprehensions (maybe there is a more Pandas-ish way exists):

df = pd.DataFrame({'col1':list('some_string')})
pd.Series([
    ''.join(df.col1.values[max(i-2, 0): i+1])
    for i in range(len(df.col1.values))
])
0       s
1      so
2     som
3     ome
4     me_
5     e_s
6     _st
7     str
8     tri
9     rin
10    ing
dtype: object
like image 26
vurmux Avatar answered Oct 02 '22 23:10

vurmux


Using pd.Series.cumsum seems like working (although bit of inefficient):

df['col1'].cumsum().str[-3:]

Output:

0       s
1      so
2     som
3     ome
4     me_
5     e_s
6     _st
7     str
8     tri
9     rin
10    ing
Name: col1, dtype: object
like image 39
Chris Avatar answered Oct 02 '22 22:10

Chris