Applying an aggregate function over multiple different slices

Question

I have a data array that contains some information about people and projects as such:

person_id | project_id | action | time
--------------------------------------
        1 |          1 |      w |    1
        1 |          2 |      w |    2
        1 |          3 |      w |    2
        1 |          3 |      r |    3
        1 |          3 |      w |    4
        1 |          4 |      w |    4
        2 |          2 |      r |    2
        2 |          2 |      w |    3

I'd like to augment this data with a couple of more fields called "first_time" and "first_time_project" that collectively identify first time any action by that person was seen and the first time that developer saw any action on the project. In the end, the data should look like this:

person_id | project_id | action | time | first_time | first_time_project
------------------------------------------------------------------------
        1 |          1 |      w |    1 |          1 |                  1
        1 |          2 |      w |    2 |          1 |                  2
        1 |          3 |      w |    2 |          1 |                  2
        1 |          3 |      r |    3 |          1 |                  2
        1 |          3 |      w |    4 |          1 |                  2
        1 |          4 |      w |    4 |          1 |                  4
        2 |          2 |      r |    2 |          2 |                  2
        2 |          2 |      w |    3 |          2 |                  2

My naive way of doing this to write a couple of loops:

for (pid in unique(data$person_id)) {
    data[data$pid==pid, "first_time"] = min(data[data$pid==pid, "time"])
    for (projid in unique(data[data$pid==pid, "project_id"])) {
        data[data$pid==pid & data$project_id==projid, "first_time_project"] = min(data[data$pid==pid & data$project_id==projid, "time"]
    }
}

Now, it doesn't take a genius to see that this is going to be glacially slow with the doubly nested loops. However, I can't figure out a way to handle this in R. I'm kinda emulating the group by option for SQL. I know that by might be able to help, but I can't figure out how to do multiple slices.

Any hints on how to take my code from glacially slow to something a bit faster? I'd be happy with a snail right now.

G. Grothendieck · Accepted Answer

Try ave :

transform(data, 
   first_time = ave(time, person_id, FUN = min),
   first_time_project = ave(time, person_id, project_id, drop = TRUE, FUN = min)
)

Applying an aggregate function over multiple different slices

Tags:

r

aggregate

data.table

Pridkett

1 Answers

G. Grothendieck

Recent Activity

Donate For Us

Applying an aggregate function over multiple different slices

Tags:

r

aggregate

data.table

Pridkett

1 Answers

G. Grothendieck

Related questions

Recent Activity

Donate For Us