This is an update to the structure of my DataFrame, I formulated the structure in haste, I was inspecting a single user and mocked up that structure. @liliscent's remark: "data accidentally satisfies this condition" is also true and value_counts and cum_sum() solves it. But then user_id's also change, and different user's can have the same meet_id if they have the same text.
Updated DataFrames structure:
mytable = pd.DataFrame({'user_id': [ '3c', '3c', '3c', '3c','3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3d',
'3d', '3d', '3d', '3e', '3e', '3r', '3w', '3w', '3w', '3w'],
'meet_id': [1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4,5,6,1,2,1,1], 'text': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc',
'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'npq', 'npq', 'npq', 'npq', 'tt', 'op', 'li', 'abc', 'xyz', 'abc', 'abc'], 'label': ['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B',
'C', 'C', 'A', 'G', 'H', 'H', 'H', 'A', 'A', 'B', 'E', 'G', 'B', 'B']})
mytable = mytable[['user_id', 'meet_id', 'text', 'label']] # ordering columns in the way I would like to be printed out.
user_id meet_id text label
3c 1 abc A
3c 1 abc A
3c 1 abc A
3c 1 abc A
3c 1 abc A
3c 1 abc B
3c 1 abc B
3c 2 xyz B
3c 2 xyz B
3c 2 xyz B
3c 2 xyz C
3c 2 xyz C
3c 2 xyz A
3d 3 npq G
3d 3 npq H
3d 3 npq H
3d 3 npq H
3e 4 tt A
3e 5 op A
3r 6 li B
3w 1 abc E
3w 2 xyz G
3w 1 abc B
3w 1 abc B
I would like to groupby on [user_id & meet_id] column and concatenate the label column in such a way that the label with higher frequency for that group is left untouched, while the second most frequent label will have the first label concatenated, and the last label will have all labels concatenated.
updated DataFrame output is what I am looking for
mytable_pro = pd.DataFrame({'user_id': ['3c', '3c', '3c', '3c','3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c', '3c','3d',
'3d', '3d', '3d', '3e', '3e', '3r', '3w', '3w', '3w', '3w'],
'meet_id': [1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4,5,6,1,2,1,1], 'text': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc',
'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz','npq', 'npq', 'npq', 'npq', 'tt', 'op', 'li', 'abc', 'xyz', 'abc', 'abc' ], 'label': ['A', 'A', 'A', 'A', 'A', 'B,A', 'B,A', 'B', 'B', 'B',
'B, C', 'B, C', 'A,B,C', 'H,G', 'H', 'H', 'H', 'A', 'A', 'B', 'E,B', 'G', 'B', 'B']})
mytable_pro = mytable_pro[['user_id', 'meet_id', 'text', 'label']] # ordering columns in the way I would like to be printed out.
This gives:
user_id meet_id text label
3c 1 abc A
3c 1 abc A
3c 1 abc A
3c 1 abc A
3c 1 abc A
3c 1 abc B,A
3c 1 abc B,A
3c 2 xyz B
3c 2 xyz B
3c 2 xyz B
3c 2 xyz B, C
3c 2 xyz B, C
3c 2 xyz A,B,C
3d 3 npq H,G
3d 3 npq H
3d 3 npq H
3d 3 npq H
3e 4 tt A
3e 5 op A
3r 6 li B
3w 1 abc E,B
3w 2 xyz G
3w 1 abc B
3w 1 abc B
The answer given by @piRSquared:
mytable.groupby('meet_id').label.value_counts().groupby('meet_id').apply(
lambda d: d.index.to_series().str[1].cumsum().str.join(', '))
is the CORRECT ANSWER for the WRONG question I asked, thanks a ton and really sorry. It solves the ordering problem as mentioned previously but would not work if a different user has the same meet_id. Just to be exhaustive, if the label frequency turns out to be equal for a group, it does not matter which of the label gets the other concatenated.
It gives:
user_id meet_id text label
3c 1 abc A
3c 1 abc A
3c 1 abc A
3c 1 abc A
3c 1 abc A
3c 1 abc A, B
3c 1 abc A, B
3c 2 xyz B
3c 2 xyz B
3c 2 xyz B
3c 2 xyz B, C
3c 2 xyz B, C
3c 2 xyz B, C, A
3d 3 npq H, G
3d 3 npq H
3d 3 npq H
3d 3 npq H
3e 4 tt A
3e 5 op A
3r 6 li B
3w 1 abc A, B, E
3w 2 xyz B, C, A, G
3w 1 abc A, B
3w 1 abc A, B
The labels for 3w are off since the labels for meet_id are picked up ignoring the difference is user_id. My bad!
Now, since user_id must also be considered, I tried the following:
s = mytable.groupby(['user_id', 'meet_id']).label.value_counts().groupby(['user_id, 'meet_id']).apply(
lambda d: d.index.to_series().str[1].cumsum().str.join(', '))
This throws:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Ah! Another small update, In reality I have words in my label column.
dummy_boo = pd.DataFrame({'user_id': ['3g', '3g', '3g'], 'meet_id': [9,9,9], 'text': ['baby', 'baby', 'baby'], 'label':['hello', 'hello', 'why']}
Output:
user_id meet_id text label
3g 9 baby hello
3g 9 baby hello
3g 9 baby why
Applying the above code is resulting in each character being separated by a comma.
user_id meet_id text label
3g 9 baby h, e, l, l, o
3g 9 baby h, e, l, l, o
3g 9 baby h, e, l, l, o, w, h, y
Instead I need:
user_id meet_id text label
3g 9 baby hello
3g 9 baby hello
3g 9 baby hello, why
The dtype for label is object. Should we use astype instead. A Big thank you to everyone for helping me out.
In pandas you can get the count of the frequency of a value that occurs in a DataFrame column by using Series. value_counts() method, alternatively, If you have a SQL background you can also get using groupby() and count() method.
groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.
value_counts() Method: Count Unique Occurrences of Values in a Column. In pandas, for a column in a DataFrame, we can use the value_counts() method to easily count the unique occurences of values.
value_counts
and cumsum
value_counts
sorts by descending count
cols = ['meet_id', 'user_id']
s = mytable.groupby(cols).label.value_counts().groupby(cols).apply(
lambda d: d.index.to_series().str[-1].cumsum().str.join(', ')
)
mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values])
user_id meet_id text label
0 3c 1 abc A
1 3c 1 abc A
2 3c 1 abc A
3 3c 1 abc A
4 3c 1 abc A
5 3c 1 abc A, B
6 3c 1 abc A, B
7 3c 2 xyz B
8 3c 2 xyz B
9 3c 2 xyz B
10 3c 2 xyz B, C
11 3c 2 xyz B, C
12 3c 2 xyz B, C, A
13 3d 3 npq H, G
14 3d 3 npq H
15 3d 3 npq H
16 3d 3 npq H
17 3e 4 tt A
18 3e 5 op A
19 3r 6 li B
20 3w 1 abc B, E
21 3w 2 xyz G
22 3w 1 abc B
23 3w 1 abc B
sorted
as wellcols = ['meet_id', 'user_id']
s = mytable.groupby(cols).label.value_counts().groupby(cols).apply(
lambda d: d.index.to_series().str[-1].cumsum().apply(sorted).str.join(', ')
)
mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values])
user_id meet_id text label
0 3c 1 abc A
1 3c 1 abc A
2 3c 1 abc A
3 3c 1 abc A
4 3c 1 abc A
5 3c 1 abc A, B
6 3c 1 abc A, B
7 3c 2 xyz B
8 3c 2 xyz B
9 3c 2 xyz B
10 3c 2 xyz B, C
11 3c 2 xyz B, C
12 3c 2 xyz A, B, C
13 3d 3 npq G, H
14 3d 3 npq H
15 3d 3 npq H
16 3d 3 npq H
17 3e 4 tt A
18 3e 5 op A
19 3r 6 li B
20 3w 1 abc B, E
21 3w 2 xyz G
22 3w 1 abc B
23 3w 1 abc B
And to adjust for words rather than single characters
cols = ['meet_id', 'user_id']
s = mytable.groupby(cols).label.value_counts().groupby(cols).apply(
lambda d: d.index.to_series().str[-1].add('|').cumsum().apply(
lambda e: ', '.join(sorted(e.strip('|').split('|')))
)
)
mytable.assign(label=[s.get((a, b, c)) for a, b, c in mytable[cols + ['label']].values])
With transform
and a custom cumulative unique function
from collections import Counter
def cum_unique(x):
return pd.Series(list(map(
Counter, x
))).cumsum().str.join(', ')
mytable.assign(label=mytable.groupby('meet_id').label.transform(cum_unique))
user_id meet_id text label
0 3c 1 abc A
1 3c 1 abc A
2 3c 1 abc A
3 3c 1 abc A
4 3c 1 abc A
5 3c 1 abc A, B
6 3c 1 abc A, B
7 3c 2 xyz B
8 3c 2 xyz B
9 3c 2 xyz B
10 3c 2 xyz B, C
11 3c 2 xyz B, C
12 3c 2 xyz B, C, A
Shortened version
mytable.assign(label=mytable.groupby('meet_id').label.transform(
lambda x: pd.Series(list(map(Counter, x))).cumsum().str.join(', ')
))
Per comment
by liliscent
We can sort first by meet_id
and group size
sizes = mytable.groupby(['meet_id', 'label']).label.transform('size')
m1 = mytable.assign(sizes=sizes).sort_values(
['meet_id', 'sizes'], ascending=[True, False]).drop('sizes', 1)
m1
m1.assign(label=m1.groupby('meet_id').label.transform(
lambda x: pd.Series(list(map(Counter, x))).cumsum().str.join(', ')
)).reindex(mytable.index)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With