This is an update to the structure of my DataFrame, I formulated the structure in haste, I was inspecting a single user and mocked up that structure. @liliscent's remark: "data accidentally satisfies this condition" is also true and value_counts and cum_sum() solves it. But then user_id's also change, and different user's can have the same meet_id if they have the same text. Updated DataFrames structure: <pre class="prettyprint"><code> mytable = pd.DataFrame({'user_id': [ '3c', '3c', '3c', '3c','3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3d', '3d', '3d', '3d', '3e', '3e', '3r', '3w', '3w', '3w', '3w'], 'meet_id': [1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4,5,6,1,2,1,1], 'text': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'npq', 'npq', 'npq', 'npq', 'tt', 'op', 'li', 'abc', 'xyz', 'abc', 'abc'], 'label': ['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B', 'C', 'C', 'A', 'G', 'H', 'H', 'H', 'A', 'A', 'B', 'E', 'G', 'B', 'B']}) mytable = mytable[['user_id', 'meet_id', 'text', 'label']] # ordering columns in the way I would like to be printed out. user_id meet_id text label 3c 1 abc A 3c 1 abc A 3c 1 abc A 3c 1 abc A 3c 1 abc A 3c 1 abc B 3c 1 abc B 3c 2 xyz B 3c 2 xyz B 3c 2 xyz B 3c 2 xyz C 3c 2 xyz C 3c 2 xyz A 3d 3 npq G 3d 3 npq H 3d 3 npq H 3d 3 npq H 3e 4 tt A 3e 5 op A 3r 6 li B 3w 1 abc E 3w 2 xyz G 3w 1 abc B 3w 1 abc B </code></pre> I would like to groupby on [user_id & meet_id] column and concatenate the label column in such a way that the label with higher frequency for that group is left untouched, while the second most frequent label will have the first label concatenated, and the last label will have all labels concatenated. updated DataFrame output is what I am looking for <pre class="prettyprint"><code> mytable_pro = pd.DataFrame({'user_id': ['3c', '3c', '3c', '3c','3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c','3d', '3d', '3d', '3d', '3e', '3e', '3r', '3w', '3w', '3w', '3w'], 'meet_id': [1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4,5,6,1,2,1,1], 'text': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz','npq', 'npq', 'npq', 'npq', 'tt', 'op', 'li', 'abc', 'xyz', 'abc', 'abc' ], 'label': ['A', 'A', 'A', 'A', 'A', 'B,A', 'B,A', 'B', 'B', 'B', 'B, C', 'B, C', 'A,B,C', 'H,G', 'H', 'H', 'H', 'A', 'A', 'B', 'E,B', 'G', 'B', 'B']}) mytable_pro = mytable_pro[['user_id', 'meet_id', 'text', 'label']] # ordering columns in the way I would like to be printed out. </code></pre> This gives: <pre class="prettyprint"><code> user_id meet_id text label 3c 1 abc A 3c 1 abc A 3c 1 abc A 3c 1 abc A 3c 1 abc A 3c 1 abc B,A 3c 1 abc B,A 3c 2 xyz B 3c 2 xyz B 3c 2 xyz B 3c 2 xyz B, C 3c 2 xyz B, C 3c 2 xyz A,B,C 3d 3 npq H,G 3d 3 npq H 3d 3 npq H 3d 3 npq H 3e 4 tt A 3e 5 op A 3r 6 li B 3w 1 abc E,B 3w 2 xyz G 3w 1 abc B 3w 1 abc B </code></pre> The answer given by @piRSquared: <pre class="prettyprint"><code> mytable.groupby('meet_id').label.value_counts().groupby('meet_id').apply( lambda d: d.index.to_series().str[1].cumsum().str.join(', ')) </code></pre> is the CORRECT ANSWER for the WRONG question I asked, thanks a ton and really sorry. It solves the ordering problem as mentioned previously but would not work if a different user has the same meet_id. Just to be exhaustive, if the label frequency turns out to be equal for a group, it does not matter which of the label gets the other concatenated. It gives: <pre class="prettyprint"><code> user_id meet_id text label 3c 1 abc A 3c 1 abc A 3c 1 abc A 3c 1 abc A 3c 1 abc A 3c 1 abc A, B 3c 1 abc A, B 3c 2 xyz B 3c 2 xyz B 3c 2 xyz B 3c 2 xyz B, C 3c 2 xyz B, C 3c 2 xyz B, C, A 3d 3 npq H, G 3d 3 npq H 3d 3 npq H 3d 3 npq H 3e 4 tt A 3e 5 op A 3r 6 li B 3w 1 abc A, B, E 3w 2 xyz B, C, A, G 3w 1 abc A, B 3w 1 abc A, B </code></pre> The labels for 3w are off since the labels for meet_id are picked up ignoring the difference is user_id. My bad! Now, since user_id must also be considered, I tried the following: <pre class="prettyprint"><code> s = mytable.groupby(['user_id', 'meet_id']).label.value_counts().groupby(['user_id, 'meet_id']).apply( lambda d: d.index.to_series().str[1].cumsum().str.join(', ')) </code></pre> This throws: <pre class="prettyprint"><code> AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas </code></pre> Ah! Another small update, In reality I have words in my label column. <pre class="prettyprint"><code> dummy_boo = pd.DataFrame({'user_id': ['3g', '3g', '3g'], 'meet_id': [9,9,9], 'text': ['baby', 'baby', 'baby'], 'label':['hello', 'hello', 'why']} </code></pre> Output: <pre class="prettyprint"><code> user_id meet_id text label 3g 9 baby hello 3g 9 baby hello 3g 9 baby why </code></pre> Applying the above code is resulting in each character being separated by a comma. <pre class="prettyprint"><code> user_id meet_id text label 3g 9 baby h, e, l, l, o 3g 9 baby h, e, l, l, o 3g 9 baby h, e, l, l, o, w, h, y </code></pre> Instead I need: <pre class="prettyprint"><code> user_id meet_id text label 3g 9 baby hello 3g 9 baby hello 3g 9 baby hello, why </code></pre> The dtype for label is object. Should we use astype instead. A Big thank you to everyone for helping me out.

<h3> <code>value_counts</code> and <code>cumsum</code> </h3> <code>value_counts</code> sorts by descending count <pre class="prettyprint"><code>cols = ['meet_id', 'user_id'] s = mytable.groupby(cols).label.value_counts().groupby(cols).apply( lambda d: d.index.to_series().str[-1].cumsum().str.join(', ') ) mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values]) user_id meet_id text label 0 3c 1 abc A 1 3c 1 abc A 2 3c 1 abc A 3 3c 1 abc A 4 3c 1 abc A 5 3c 1 abc A, B 6 3c 1 abc A, B 7 3c 2 xyz B 8 3c 2 xyz B 9 3c 2 xyz B 10 3c 2 xyz B, C 11 3c 2 xyz B, C 12 3c 2 xyz B, C, A 13 3d 3 npq H, G 14 3d 3 npq H 15 3d 3 npq H 16 3d 3 npq H 17 3e 4 tt A 18 3e 5 op A 19 3r 6 li B 20 3w 1 abc B, E 21 3w 2 xyz G 22 3w 1 abc B 23 3w 1 abc B </code></pre> <hr> <h3>Include <code>sorted</code> as well</h3> <pre class="prettyprint"><code>cols = ['meet_id', 'user_id'] s = mytable.groupby(cols).label.value_counts().groupby(cols).apply( lambda d: d.index.to_series().str[-1].cumsum().apply(sorted).str.join(', ') ) mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values]) user_id meet_id text label 0 3c 1 abc A 1 3c 1 abc A 2 3c 1 abc A 3 3c 1 abc A 4 3c 1 abc A 5 3c 1 abc A, B 6 3c 1 abc A, B 7 3c 2 xyz B 8 3c 2 xyz B 9 3c 2 xyz B 10 3c 2 xyz B, C 11 3c 2 xyz B, C 12 3c 2 xyz A, B, C 13 3d 3 npq G, H 14 3d 3 npq H 15 3d 3 npq H 16 3d 3 npq H 17 3e 4 tt A 18 3e 5 op A 19 3r 6 li B 20 3w 1 abc B, E 21 3w 2 xyz G 22 3w 1 abc B 23 3w 1 abc B </code></pre> And to adjust for words rather than single characters <pre class="prettyprint"><code>cols = ['meet_id', 'user_id'] s = mytable.groupby(cols).label.value_counts().groupby(cols).apply( lambda d: d.index.to_series().str[-1].add('|').cumsum().apply( lambda e: ', '.join(sorted(e.strip('|').split('|'))) ) ) mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values]) </code></pre> <hr> <h3>Old Answer</h3> With <code>transform</code> and a custom cumulative unique function <pre class="prettyprint"><code>from collections import Counter def cum_unique(x): return pd.Series(list(map( Counter, x ))).cumsum().str.join(', ') mytable.assign(label=mytable.groupby('meet_id').label.transform(cum_unique)) user_id meet_id text label 0 3c 1 abc A 1 3c 1 abc A 2 3c 1 abc A 3 3c 1 abc A 4 3c 1 abc A 5 3c 1 abc A, B 6 3c 1 abc A, B 7 3c 2 xyz B 8 3c 2 xyz B 9 3c 2 xyz B 10 3c 2 xyz B, C 11 3c 2 xyz B, C 12 3c 2 xyz B, C, A </code></pre> Shortened version <pre class="prettyprint"><code>mytable.assign(label=mytable.groupby('meet_id').label.transform( lambda x: pd.Series(list(map(Counter, x))).cumsum().str.join(', ') )) </code></pre> <hr> Per comment by liliscent We can sort first by <code>meet_id</code> and group size <pre class="prettyprint"><code>sizes = mytable.groupby(['meet_id', 'label']).label.transform('size') m1 = mytable.assign(sizes=sizes).sort_values( ['meet_id', 'sizes'], ascending=[True, False]).drop('sizes', 1) m1 m1.assign(label=m1.groupby('meet_id').label.transform( lambda x: pd.Series(list(map(Counter, x))).cumsum().str.join(', ') )).reindex(mytable.index) </code></pre>

Groupby on pandas dataframe and concatenate strings with comma based on the frequency of values in a column

Tags:

python

pandas

dataframe

pandas-groupby

This is an update to the structure of my DataFrame, I formulated the structure in haste, I was inspecting a single user and mocked up that structure. @liliscent's remark: "data accidentally satisfies this condition" is also true and value_counts and cum_sum() solves it. But then user_id's also change, and different user's can have the same meet_id if they have the same text.

Updated DataFrames structure:

   mytable = pd.DataFrame({'user_id': [ '3c', '3c', '3c', '3c','3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3d',
                                 '3d', '3d', '3d', '3e', '3e', '3r', '3w', '3w', '3w', '3w'],
              'meet_id': [1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4,5,6,1,2,1,1], 'text': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc',
        'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'npq', 'npq', 'npq', 'npq', 'tt', 'op', 'li', 'abc', 'xyz', 'abc', 'abc'], 'label': ['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B',
    'C', 'C', 'A', 'G', 'H', 'H', 'H', 'A', 'A', 'B', 'E', 'G', 'B', 'B']})
   mytable =  mytable[['user_id', 'meet_id', 'text', 'label']] # ordering columns in the way I would like to be printed out.

   user_id  meet_id  text label
   3c        1      abc     A
   3c        1      abc     A
   3c        1      abc     A
   3c        1      abc     A
   3c        1      abc     A
   3c        1      abc     B
   3c        1      abc     B
   3c        2      xyz     B
   3c        2      xyz     B
   3c        2      xyz     B
   3c        2      xyz     C
   3c        2      xyz     C
   3c        2      xyz     A
   3d        3      npq     G
   3d        3      npq     H
   3d        3      npq     H
   3d        3      npq     H
   3e        4      tt      A
   3e        5      op      A
   3r        6      li      B
   3w        1      abc     E
   3w        2      xyz     G 
   3w        1      abc     B
   3w        1      abc     B

I would like to groupby on [user_id & meet_id] column and concatenate the label column in such a way that the label with higher frequency for that group is left untouched, while the second most frequent label will have the first label concatenated, and the last label will have all labels concatenated.

updated DataFrame output is what I am looking for

    mytable_pro = pd.DataFrame({'user_id': ['3c', '3c', '3c', '3c','3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c','3d',
                                 '3d', '3d', '3d', '3e', '3e', '3r', '3w', '3w', '3w', '3w'],
              'meet_id': [1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4,5,6,1,2,1,1], 'text': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc',
        'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz','npq', 'npq', 'npq', 'npq', 'tt', 'op', 'li', 'abc', 'xyz', 'abc', 'abc' ], 'label': ['A', 'A', 'A', 'A', 'A', 'B,A', 'B,A', 'B', 'B', 'B',
    'B, C', 'B, C', 'A,B,C', 'H,G', 'H', 'H', 'H', 'A', 'A', 'B', 'E,B', 'G', 'B', 'B']})
    mytable_pro = mytable_pro[['user_id', 'meet_id', 'text', 'label']] # ordering columns in the way I would like to be printed out.

This gives:

    user_id  meet_id text  label
   3c        1       abc      A
   3c        1       abc      A
   3c        1       abc      A
   3c        1       abc      A
   3c        1       abc      A
   3c        1       abc     B,A
   3c        1       abc     B,A
   3c        2       xyz      B
   3c        2       xyz      B
   3c        2       xyz      B
   3c        2       xyz    B, C
   3c        2       xyz    B, C
   3c        2       xyz    A,B,C
   3d        3       npq     H,G
   3d        3       npq      H
   3d        3       npq      H
   3d        3       npq      H
   3e        4       tt       A
   3e        5       op       A
   3r        6       li       B
   3w        1       abc     E,B
   3w        2       xyz      G
   3w        1       abc      B
   3w        1       abc      B

The answer given by @piRSquared:

    mytable.groupby('meet_id').label.value_counts().groupby('meet_id').apply(
lambda d: d.index.to_series().str[1].cumsum().str.join(', '))

is the CORRECT ANSWER for the WRONG question I asked, thanks a ton and really sorry. It solves the ordering problem as mentioned previously but would not work if a different user has the same meet_id. Just to be exhaustive, if the label frequency turns out to be equal for a group, it does not matter which of the label gets the other concatenated.

It gives:

     user_id  meet_id  text       label
   3c          1         abc           A
   3c          1         abc           A
   3c          1         abc           A
   3c          1         abc           A
   3c          1         abc           A
   3c          1         abc        A, B
   3c          1         abc        A, B
   3c          2         xyz           B
   3c          2         xyz           B
   3c          2         xyz           B
   3c          2         xyz        B, C
   3c          2         xyz        B, C
   3c          2         xyz     B, C, A
   3d          3         npq        H, G
   3d          3         npq           H
   3d          3         npq           H
   3d          3         npq           H
   3e          4          tt           A
   3e          5          op           A
   3r          6          li           B
   3w          1         abc     A, B, E
   3w          2         xyz    B, C, A, G
   3w          1         abc        A, B
   3w          1         abc        A, B

The labels for 3w are off since the labels for meet_id are picked up ignoring the difference is user_id. My bad!

Now, since user_id must also be considered, I tried the following:

    s = mytable.groupby(['user_id', 'meet_id']).label.value_counts().groupby(['user_id, 'meet_id']).apply(
lambda d: d.index.to_series().str[1].cumsum().str.join(', '))

This throws:

    AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Ah! Another small update, In reality I have words in my label column.

    dummy_boo = pd.DataFrame({'user_id': ['3g', '3g', '3g'], 'meet_id': [9,9,9], 'text': ['baby', 'baby', 'baby'], 'label':['hello', 'hello', 'why']}

Output:

    user_id  meet_id  text  label
      3g        9     baby  hello
      3g        9     baby  hello
      3g        9     baby   why

Applying the above code is resulting in each character being separated by a comma.

 user_id  meet_id  text   label
  3g        9      baby  h, e, l, l, o
  3g        9      baby  h, e, l, l, o
  3g        9      baby  h, e, l, l, o, w, h, y

Instead I need:

    user_id  meet_id  text   label
  3g        9      baby    hello
  3g        9      baby    hello
  3g        9      baby    hello, why

The dtype for label is object. Should we use astype instead. A Big thank you to everyone for helping me out.

440

asked May 09 '18 17:05

Shiva Kumar

1 Answers

`value_counts` and `cumsum`

value_counts sorts by descending count

cols = ['meet_id', 'user_id']
s = mytable.groupby(cols).label.value_counts().groupby(cols).apply(
    lambda d: d.index.to_series().str[-1].cumsum().str.join(', ')
)

mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values])

   user_id  meet_id text    label
0       3c        1  abc        A
1       3c        1  abc        A
2       3c        1  abc        A
3       3c        1  abc        A
4       3c        1  abc        A
5       3c        1  abc     A, B
6       3c        1  abc     A, B
7       3c        2  xyz        B
8       3c        2  xyz        B
9       3c        2  xyz        B
10      3c        2  xyz     B, C
11      3c        2  xyz     B, C
12      3c        2  xyz  B, C, A
13      3d        3  npq     H, G
14      3d        3  npq        H
15      3d        3  npq        H
16      3d        3  npq        H
17      3e        4   tt        A
18      3e        5   op        A
19      3r        6   li        B
20      3w        1  abc     B, E
21      3w        2  xyz        G
22      3w        1  abc        B
23      3w        1  abc        B

Include `sorted` as well

cols = ['meet_id', 'user_id']
s = mytable.groupby(cols).label.value_counts().groupby(cols).apply(
    lambda d: d.index.to_series().str[-1].cumsum().apply(sorted).str.join(', ')
)

mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values])

   user_id  meet_id text    label
0       3c        1  abc        A
1       3c        1  abc        A
2       3c        1  abc        A
3       3c        1  abc        A
4       3c        1  abc        A
5       3c        1  abc     A, B
6       3c        1  abc     A, B
7       3c        2  xyz        B
8       3c        2  xyz        B
9       3c        2  xyz        B
10      3c        2  xyz     B, C
11      3c        2  xyz     B, C
12      3c        2  xyz  A, B, C
13      3d        3  npq     G, H
14      3d        3  npq        H
15      3d        3  npq        H
16      3d        3  npq        H
17      3e        4   tt        A
18      3e        5   op        A
19      3r        6   li        B
20      3w        1  abc     B, E
21      3w        2  xyz        G
22      3w        1  abc        B
23      3w        1  abc        B

And to adjust for words rather than single characters

cols = ['meet_id', 'user_id']
s = mytable.groupby(cols).label.value_counts().groupby(cols).apply(
    lambda d: d.index.to_series().str[-1].add('|').cumsum().apply(
        lambda e: ', '.join(sorted(e.strip('|').split('|')))
    )
)

mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values])

Old Answer

With transform and a custom cumulative unique function

from collections import Counter

def cum_unique(x):
    return pd.Series(list(map(
        Counter, x
    ))).cumsum().str.join(', ')

mytable.assign(label=mytable.groupby('meet_id').label.transform(cum_unique))

   user_id  meet_id text    label
0       3c        1  abc        A
1       3c        1  abc        A
2       3c        1  abc        A
3       3c        1  abc        A
4       3c        1  abc        A
5       3c        1  abc     A, B
6       3c        1  abc     A, B
7       3c        2  xyz        B
8       3c        2  xyz        B
9       3c        2  xyz        B
10      3c        2  xyz     B, C
11      3c        2  xyz     B, C
12      3c        2  xyz  B, C, A

Shortened version

mytable.assign(label=mytable.groupby('meet_id').label.transform(
    lambda x: pd.Series(list(map(Counter, x))).cumsum().str.join(', ')
))

Per comment

by liliscent

We can sort first by meet_id and group size

sizes = mytable.groupby(['meet_id', 'label']).label.transform('size')

m1 = mytable.assign(sizes=sizes).sort_values(
    ['meet_id', 'sizes'], ascending=[True, False]).drop('sizes', 1)
m1

m1.assign(label=m1.groupby('meet_id').label.transform(
    lambda x: pd.Series(list(map(Counter, x))).cumsum().str.join(', ')
)).reindex(mytable.index)

180

answered Nov 08 '22 19:11

piRSquared

Related questions
                            
                                Python pandas datareader no longer works for yahoo-finance changed url
                            
                                Matplotlib: how to make imshow read x,y coordinates from other numpy arrays?
                            
                                filter object becomes empty after iteration? [duplicate]
                            
                                Python access dictionary inside list of a dictionary
                            
                                Django - disable one of system checks
                            
                                python:pandas - How to combine first two rows of pandas dataframe to dataframe header?
                            
                                Error when executing compiled file " No module named 'scipy._lib.messagestream' "after using pyinstaller
                            
                                Matplotlib how to set legend's font type
                            
                                Load xlsx file from drive in colaboratory
                            
                                how to featureUnion numerical and text features in python sklearn properly
                            
                                Compute Euclidean distance between rows of two pandas dataframes
                            
                                Git/Heroku - How to hide my SECRET_KEY?
                            
                                How to convert log probability into simple probability between 0 and 1 values using python
                            
                                django-rest-framwork got AttributeError when attempting to get a value for field
                            
                                psycopg2.OperationalError: FATAL: password authentication failed for user "<my UNIX user>"
                            
                                Docker crontab: not found
                            
                                How to stream audio from a Youtube URL in Python (without download)?
                            
                                Pipenv install matplotlib
                            
                                Add text with PdfPages - matplotlib
                            
                                Selenium on MAC, Message: 'chromedriver' executable may have wrong permissions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Groupby on pandas dataframe and concatenate strings with comma based on the frequency of values in a column

Tags:

python

pandas

dataframe

pandas-groupby

Shiva Kumar

People also ask

1 Answers

`value_counts` and `cumsum`

Include `sorted` as well

Old Answer

piRSquared

Recent Activity

Donate For Us

Groupby on pandas dataframe and concatenate strings with comma based on the frequency of values in a column

Tags:

python

pandas

dataframe

pandas-groupby

Shiva Kumar

People also ask

1 Answers

value_counts and cumsum

Include sorted as well

Old Answer

piRSquared

Related questions

Recent Activity

Donate For Us

`value_counts` and `cumsum`

Include `sorted` as well