Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grouping on string fields in R

I have a data frame like this:

         date      time userid        status
1  02/25/2012  09:22:10   aabc     logged_in
2  02/25/2012  09:30:10   aabc    logged_out
3  02/25/2012  09:29:20   abbc    logged_out
4  02/25/2012  09:27:30    abc     logged_in
5  02/25/2012  09:26:29    abc  login_failed
6  02/25/2012  09:26:39    abc  login_failed
7  02/25/2012  09:26:52    abc  login_failed
8  02/25/2012  09:27:09    abc  login_failed
9  02/25/2012  09:27:20    abc  login_failed
10 02/25/2012  09:24:10   abdc     logged_in
11 02/25/2012  09:24:12   abdc    logged_out
12 02/25/2012  09:22:10   abhc     logged_in
13 02/25/2012  09:30:10   abuc     logged_in
14 02/25/2012  09:30:14   abuc    logged_out
15 02/25/2012  09:29:40    baa     logged_in

I want to the userids, status and "count" of the login_failures for each userid. I did this:

ddply(mytbl, c('userid', 'status'), function(x) c(count=nrow(x))) , but this gives the count for all userids. I want to restrict my output to only those userids whose status is 'login _failed'. Any ideas? I have seen questions on grouping by numeric fields, but, none on strings.

I am not very familiar with all the plyr features. It will great to see how this can be done using summarize, aggregate, sqldf, data.table etc. Slowly understanding each of them.

Thanks Sri

like image 351
user1717931 Avatar asked Dec 20 '22 15:12

user1717931


2 Answers

require(data.table)
DT = as.data.table(mytbl)

DT[status=="login_failed", .N, by=userid]

To name the column :

DT[status=="login_failed", list(failed_logins=.N), by=userid]
like image 103
Matt Dowle Avatar answered Jan 13 '23 12:01

Matt Dowle


Slightly different approach than @Maiasaura. I filter to just the failed logins and then summarize. The difference would be whether those userid's with logins, but no failed logins, appear in the final result with 0's or not.

ddply(mytbl[mytbl$status=="login_failed",], .(userid), 
  summarise, failed_logins=length(status))

This gives

> ddply(mytbl[mytbl$status=="login_failed",], .(userid), 
+   summarise, failed_logins=length(status))
  userid failed_logins
1    abc             5

To complete the approaches, if you want all the userid's:

ddply(mytbl, .(userid), 
  summarise, failed_logins = sum(status=="login_failed"))

which gives

> ddply(mytbl, .(userid), 
+   summarise, failed_logins = sum(status=="login_failed"))
  userid failed_logins
1   aabc             0
2   abbc             0
3    abc             5
4   abdc             0
5   abhc             0
6   abuc             0
7    baa             0
like image 35
Brian Diggs Avatar answered Jan 13 '23 13:01

Brian Diggs