I have a large data frame that looks like so (and is copy-pasteable with df=pd.read_clipboard(sep='\s\s+')
:
user_nm month unique_ips shifted_ips halves quarters mo_pairs
100118231 2 set([142.136]) set([]) h1 q1 p1
100118231 3 set([142.136]) set([142.136]) h1 q1 p2
100118231 6 set([108.0]) set([142.136]) h1 q2 p3
100118231 7 set([108.0]) set([108.0]) h2 q3 p4
100118231 8 set([142.136]) set([108.0]) h2 q3 p4
100118231 9 set([142.136]) set([142.136]) h2 q3 p5
100118231 10 set([142.136]) set([142.136]) h2 q4 p5
100118231 11 set([142.136]) set([142.136]) h2 q4 p6
100406016 3 set([50.192]) set([]) h1 q1 p2
100406016 7 set([50.192]) set([50.192]) h2 q3 p4
for each user, I want to group by halves
(or quarters
, or mo_pairs
) and get the union of unique_ips
and shifted_ips
.
I can groupby the fields like so:
In [265]: a=df.groupby(['user_nm','halves'])
In [266]: a.head()
Out[266]:
user_nm month unique_ips shifted_ips halves quarters mo_pairs
user_nm halves
100118231 h1 0 100118231 2 set([142.136]) set([]) h1 q1 p1
1 100118231 3 set([142.136]) set([142.136]) h1 q1 p2
2 100118231 6 set([108.0]) set([142.136]) h1 q2 p3
h2 3 100118231 7 set([108.0]) set([108.0]) h2 q3 p4
4 100118231 8 set([142.136]) set([108.0]) h2 q3 p4
5 100118231 9 set([142.136]) set([142.136]) h2 q3 p5
6 100118231 10 set([142.136]) set([142.136]) h2 q4 p5
7 100118231 11 set([142.136]) set([142.136]) h2 q4 p6
100406016 h1 8 100406016 3 set([50.192]) set([]) h1 q1 p2
h2 9 100406016 7 set([50.192]) set([50.192]) h2 q3 p4
However, when I attempt to union these rows, I get an error:
In [267]: a.apply(lambda x: x[2] & x[3], axis=1)
TypeError: <lambda>() got an unexpected keyword argument 'axis'
Ideally, I would like something like this:
unique_ips shifted_ips
user_nm halves
100118231 h1 set([142.136, 108.0]) set([142.136])
100118231 h2 set([142.136,108.0]) set([142.136,108.0])
100406016 h1 set([50.192]) set([])
100406016 h2 set([50.192]) set([50.192])
I've also tried set_index
, but that does not group the dataframe appropriately
b=df.set_index(['user_nm','halves'])
This seems like a relatively simple task, what am I missing?
The short answer to this is you need to use aggregate
method while reducing groupyby
object Pandas GroupBy Aggregate.
Now the following snippet should solve your problem
Properly handling set while reading : the elements were coming out as str
and not set
df.unique_ips = df.unique_ips.apply(eval)
df.shifted_ips = df.shifted_ips.apply(eval)
Doing the groupby
grouped = df.groupby(['user_nm', 'halves'])
my_lambda = lambda x: reduce(set.union, x)
output = grouped.aggregate({'unique_ips': my_lambda,
'shifted_ips': my_lambda})
The result being:
unique_ips shifted_ips
user_nm halves
100118231 h1 set([142.136, 108.0]) set([142.136])
h2 set([142.136, 108.0]) set([142.136, 108.0])
100406016 h1 set([50.192]) set([])
h2 set([50.192]) set([50.192])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With