Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: union duplicate strings

Tags:

python

pandas

I have dataframe

ID     url     date   active_seconds
111    vk.com   12.01.2016   5
111    facebook.com   12.01.2016   4
111    facebook.com   12.01.2016   3
111    twitter.com    12.01.2016    12
222    vk.com      12.01.2016   8
222    twitter.com    12.01.2016   34
111    facebook.com   12.01.2016   5

and i need to get

ID     url     date   active_seconds
111    vk.com   12.01.2016   5
111    facebook.com   12.01.2016   7
111    twitter.com    12.01.2016    12
222    vk.com      12.01.2016   8
222    twitter.com    12.01.2016   34
111    facebook.com   12.01.2016   5

If I try

df.groupby(['ID', 'url'])['active_seconds'].sum()

it unions all strings. How should I do to get desirable?

like image 851
Petr Petrov Avatar asked Jan 13 '17 11:01

Petr Petrov


3 Answers

Solutions 1 - cumsum by column url only:

You need groupby by custom Series created by cumsum of boolean mask, but then column url need aggregate by first. Then remove level url with reset_index and last reorder columns by reindex:

g = (df.url != df.url.shift()).cumsum()
print (g)
0    1
1    2
2    2
3    3
4    4
5    5
6    6
Name: url, dtype: int32

g = (df.url != df.url.shift()).cumsum()
#another solution with ne
#g =  df.url.ne(df.url.shift()).cumsum()

print (df.groupby([df.ID,df.date,g], sort=False).agg({'active_seconds':'sum', 'url':'first'})
         .reset_index(level='url', drop=True)
         .reset_index()
         .reindex(columns=df.columns))

    ID           url        date  active_seconds
0  111        vk.com  12.01.2016               5
1  111  facebook.com  12.01.2016               7
2  111   twitter.com  12.01.2016              12
3  222        vk.com  12.01.2016               8
4  222   twitter.com  12.01.2016              34
5  111  facebook.com  12.01.2016               5

g = (df.url != df.url.shift()).cumsum().rename('tmp')
print (g)
0    1
1    2
2    2
3    3
4    4
5    5
6    6
Name: tmp, dtype: int32

print (df.groupby([df.ID, df.url, df.date, g], sort=False)['active_seconds']
         .sum()
         .reset_index(level='tmp', drop=True)
         .reset_index())

    ID           url        date  active_seconds
0  111        vk.com  12.01.2016               5
1  111  facebook.com  12.01.2016               7
2  111   twitter.com  12.01.2016              12
3  222        vk.com  12.01.2016               8
4  222   twitter.com  12.01.2016              34
5  111  facebook.com  12.01.2016               5

Solutions 2 - cumsum by columns ID and url:

g =  df[['ID','url']].ne(df[['ID','url']].shift()).cumsum()
print (g)
   ID  url
0   1    1
1   1    2
2   1    2
3   1    3
4   2    4
5   2    5
6   3    6

print (df.groupby([g.ID, df.date, g.url], sort=False)
         .agg({'active_seconds':'sum', 'url':'first'})
         .reset_index(level='url', drop=True)
         .reset_index()
         .reindex(columns=df.columns))

   ID           url        date  active_seconds
0   1        vk.com  12.01.2016               5
1   1  facebook.com  12.01.2016               7
2   1   twitter.com  12.01.2016              12
3   2        vk.com  12.01.2016               8
4   2   twitter.com  12.01.2016              34
5   3  facebook.com  12.01.2016               5

And solution where add column df.url, but is necessary rename columns in helper df:

g =  df[['ID','url']].ne(df[['ID','url']].shift()).cumsum()
g.columns = g.columns + '1'
print (g)
   ID1  url1
0    1     1
1    1     2
2    1     2
3    1     3
4    2     4
5    2     5
6    3     6

print (df.groupby([df.ID, df.url, df.date, g.ID1, g.url1], sort=False)['active_seconds']
         .sum()
         .reset_index(level=['ID1','url1'], drop=True)
         .reset_index())

    ID           url        date  active_seconds
0  111        vk.com  12.01.2016               5
1  111  facebook.com  12.01.2016               7
2  111   twitter.com  12.01.2016              12
3  222        vk.com  12.01.2016               8
4  222   twitter.com  12.01.2016              34
5  111  facebook.com  12.01.2016               5

Timings:

Similar solutions, but pivot_table is slowier as groupby:

In [180]: %timeit (df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum').reset_index([1, 2, 3]).reset_index(drop=True))
100 loops, best of 3: 5.02 ms per loop

In [181]: %timeit (df.groupby([df.ID, df.url, df.date, (df.url != df.url.shift()).cumsum().rename('tmp')], sort=False)['active_seconds'].sum().reset_index(level='tmp', drop=True).reset_index())
100 loops, best of 3: 3.62 ms per loop
like image 26
jezrael Avatar answered Oct 19 '22 03:10

jezrael


  • (s != s.shift()).cumsum() is a typical way to identify groups of contiguous identifiers
  • pd.DataFrame.assign is a convenient way to add a new column to a copy of a dataframe and chain more methods
  • pivot_table allows us to reconfigure our table and aggregate
  • args - this is a style preference of mine to keep code cleaner looking. I'll pass these arguments to pivot_table via *args
  • reset_index * 2 to clean up and get to final result

args = ('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum')
df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table(*args) \
    .reset_index([1, 2, 3]).reset_index(drop=True)

    ID           url        date  active_seconds
0  111  facebook.com  12.01.2016               7
1  111   twitter.com  12.01.2016              12
2  111        vk.com  12.01.2016               5
3  222   twitter.com  12.01.2016              34
4  222        vk.com  12.01.2016               8
5  111  facebook.com  12.01.2016               5
like image 198
piRSquared Avatar answered Oct 19 '22 02:10

piRSquared


it looks like you want a cumsum():

In [195]: df.groupby(['ID', 'url'])['active_seconds'].cumsum()
Out[195]:
0     5
1     4
2     7
3    12
4     8
5    34
6    12
Name: active_seconds, dtype: int64
like image 2
MaxU - stop WAR against UA Avatar answered Oct 19 '22 04:10

MaxU - stop WAR against UA