Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr and tail to change last value in a group_by in r

Tags:

r

dplyr

tail

while using dplyr i'm having trouble changing the last value my data frame. i want to group by user and tag and change the Time to 0 for the last value / row in the group.

     user_id     tag   Time
1  268096674       1    3
2  268096674       1    10
3  268096674       1    1
4  268096674       1    0
5  268096674       1    9999
6  268096674       2    0
7  268096674       2    9
8  268096674       2    500
9  268096674       3    0
10 268096674       3    1
...

Desired output:

     user_id     tag   Time
1  268096674       1    3
2  268096674       1    10
3  268096674       1    1
4  268096674       1    0
5  268096674       1    0
6  268096674       2    0
7  268096674       2    9
8  268096674       2    0
9  268096674       3    0
10 268096674       3    1
...

I've tried to do something like this, among others and can't figure it out:

df %>%
  group_by(user_id,tag) %>%
  mutate(tail(Time) <- 0)

I tried adding a row number as well, but couldn't quite put it all together. any help would be appreciated.

like image 238
itjcms18 Avatar asked Apr 26 '15 18:04

itjcms18


People also ask

How do I select the last observation in a group in R?

You can do that by using the function arrange from dplyr. 2. Use the dplyr filter function to get the first and the last row of each group.

What does group_ by() do in r?

Description. Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping.

Can you group_by two variables in R?

By using group_by() function from dplyr package we can perform group by on multiple columns or variables (two or more columns) and summarise on multiple columns for aggregations. Later, I will also explain how to apply summarise() on all columns and finally use multiple aggregation functions together.

Is dplyr slower than base R?

table function, the base R function is almost 4 times and the dplyr function is 3 times slower!


2 Answers

Here's an option:

df %>%
  group_by(user_id, tag) %>%
  mutate(Time = c(Time[-n()], 0))
#Source: local data frame [10 x 3]
#Groups: user_id, tag
#
#     user_id tag Time
#1  268096674   1    3
#2  268096674   1   10
#3  268096674   1    1
#4  268096674   1    0
#5  268096674   1    0
#6  268096674   2    0
#7  268096674   2    9
#8  268096674   2    0
#9  268096674   3    0
#10 268096674   3    0

What I did here is: create a vector of the existing column "Time" with all elements except for the last one in the group, which has the index n() and add to that vector a 0 as last element using c() for concatenation.

Note that in my output the Time value in row 10 is also changed to 0 because it is considered the last entry of the group.

like image 141
talat Avatar answered Oct 18 '22 22:10

talat


I would like to offer an alternative approach which will avoid copying the whole column (what both Time[-n()] and replace do) and allow modifying in place

library(data.table)
indx <- setDT(df)[, .I[.N], by = .(user_id, tag)]$V1 # finding the last incidences per group
df[indx, Time := 0L] # modifying in place
df
#       user_id tag Time
#  1: 268096674   1    3
#  2: 268096674   1   10
#  3: 268096674   1    1
#  4: 268096674   1    0
#  5: 268096674   1    0
#  6: 268096674   2    0
#  7: 268096674   2    9
#  8: 268096674   2    0
#  9: 268096674   3    0
# 10: 268096674   3    0
like image 23
David Arenburg Avatar answered Oct 18 '22 21:10

David Arenburg