I have a data.frame df
. I want to create a new variable using the output from summarize
as the index to retrieve the value from a column in the original data.frame.
df.l
has the following columns trial
, location
, posi
, date
, and value
.
I want to use the the sum of "value
==1" for each group(trial
, location
,date
) as an index from which to select the value from posi
and store it as new variable.
value
indf.l
can be 1 or 0 (once it becomes zero it remains so, as long as its ordered correctly, i.e. posi
0 - 1). This grouped sum indicates where value changes from 1 to 0 within the group.
To determine the index location the following code works:
test <- df.l %>%
group_by(trial, location, date) %>%
summarise(n= sum(value==1))
but of course, posi
is missing.
I was hoping that something like the code below would work, but it doesn't. It starts out with correct results, but somewhere the indexing goes awry. I don't know if it make sense to call a column like I did.
test <- df.l %>%
group_by(trial, location, date) %>%
summarise(n= sum(value==1)) %>%
mutate(ANS = nth(df.l$posi,n))
Using dplyr
can I create an "index" from a group to select a value from the original data.frame, and then add this variable to the new data.frame? Or, is there another approach using dplyr to achieve the same results?
# truncated data.frame
df.l <- structure(list(trial = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
location = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), posi = c(0,
0.28, 0.65, 1, 0, 0.33, 0.67, 1, 0, 0.2, 0.5, 1, 0, 0.28,
0.65, 1, 0, 0.33, 0.67, 1, 0, 0.2, 0.5, 1), date = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), value = c(1L, 1L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("trial", "location", "posi", "date", "value"), row.names = c(NA, 24L), class = "data.frame")
#desired result
result <- structure(list(trial = c(1L, 1L, 1L, 2L, 2L, 2L), location = c(1L,
2L, 3L, 1L, 2L, 3L), date = c(1L, 1L, 1L, 1L, 1L, 1L), n = c(3L,
4L, 4L, 1L, 4L, 2L), posi = c(0.65, 1, 1, 0, 1, 0.2)), class = "data.frame", .Names = c("trial",
"location", "date", "n", "posi"), row.names = c(NA, -6L))
You can do it inside the summarise
:
df.l %>%
group_by(trial, location, date) %>%
summarise(n= sum(value==1), ANS = nth(posi,n))
#Source: local data frame [6 x 5]
#Groups: trial, location
#
# trial location date n ANS
#1 1 1 1 3 0.65
#2 1 2 1 4 1.00
#3 1 3 1 4 1.00
#4 2 1 1 1 0.00
#5 2 2 1 4 1.00
#6 2 3 1 2 0.20
Or, if you don't actually need the n
in the result, you could do
df.l %>%
group_by(trial, location, date) %>%
summarise(ANS = nth(posi, sum(value == 1)))
Or
df.l %>%
group_by(trial, location, date) %>%
summarise(ANS = posi[sum(value == 1)])
slice
seems like the most natural option here:
df.l %>% group_by(trial,location,date) %>% mutate(n=row_number()) %>% slice(sum(value))
This gives
trial location posi date value n
1 1 1 0.65 1 1 3
2 1 2 1.00 1 1 4
3 1 3 1.00 1 1 4
4 2 1 0.00 1 1 1
5 2 2 1.00 1 1 4
6 2 3 0.20 1 1 2
The slice
function selects one or more rows according to their indices (within a group if applicable), exactly as the OP describes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With