I have a pandas DataFrame with log data:
host service
0 this.com mail
1 this.com mail
2 this.com web
3 that.com mail
4 other.net mail
5 other.net web
6 other.net web
And I want to find the service on every host that gives the most errors:
host service no
0 this.com mail 2
1 that.com mail 1
2 other.net web 2
The only solution I found was grouping by host and service, and then iterating over the level 0 of the index.
Can anyone suggest a better, shorter version? without the Iteration?
df = df_logfile.groupby(['host','service']).agg({'service':np.size})
df_count = pd.DataFrame()
df_count['host'] = df_logfile['host'].unique()
df_count['service'] = np.nan
df_count['no'] = np.nan
for h,data in df.groupby(level=0):
i = data.idxmax()[0]
service = i[1]
no = data.xs(i)[0]
df_count.loc[df_count['host'] == h, 'service'] = service
df_count.loc[(df_count['host'] == h) & (df_count['service'] == service), 'no'] = no
full code https://gist.github.com/bjelline/d8066de66e305887b714
Given df, the next step is to group by the host value alone and
aggregate by idxmax. This gives you the index which
corresponds the the greatest service value. You can then use df.loc[...] to select the rows in df which correspond to the greatest service values:
import numpy as np
import pandas as pd
df_logfile = pd.DataFrame({
'host' : ['this.com', 'this.com', 'this.com', 'that.com', 'other.net',
'other.net', 'other.net'],
'service' : ['mail', 'mail', 'web', 'mail', 'mail', 'web', 'web' ] })
df = df_logfile.groupby(['host','service'])['service'].agg({'no':'count'})
mask = df.groupby(level=0).agg('idxmax')
df_count = df.loc[mask['no']]
df_count = df_count.reset_index()
print("\nOutput\n{}".format(df_count))
yields the DataFrame
host service no
0 other.net web 2
1 that.com mail 1
2 this.com mail 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With