Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select first observed data and utilize mutate

Tags:

r

dplyr

I am running into an issue with my data where I want to take the first observed ob score score for each individual id and subtract that from that last observed score.

The problem with asking for the first observation minus the last observation is that sometimes the first observation data is missing.

Is there anyway to ask for the first observed score for each individual, thus skipping any missing data?

I built the below df to illustrate my problem.

help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
                   ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
                   score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))

   id ob score
1   5  1    NA
2   5  2     2
3   5  3     3
4   5  4     4
5   5  5     3
6  12  1     7
7  12  2     3
8  12  3     4
9  17  1     3
10 17  2     4
11 20  1    NA
12 20  2     1
13 20  3     4

And what I am hoping to run is code that will give me...

   id ob score  es
1   5  1    NA  -1
2   5  2     2  -1
3   5  3     3  -1
4   5  4     4  -1
5   5  5     3  -1
6  12  1     7   3
7  12  2     3   3
8  12  3     4   3
9  17  1     3  -1
10 17  2     4  -1
11 20  1    NA  -3
12 20  2     1  -3
13 20  3     4  -3

I am attempting to work out of dplyr and I understand the use of the 'group_by' command, however, not sure how to 'select' only first observed scores and then mutate to create es.

like image 780
b222 Avatar asked Jun 11 '15 17:06

b222


People also ask

What does mutate () do in R?

What is the mutate() function in R? We can use the mutate() function in R programming to add new variables in the specified data frame. These new variables are added by performing the operations on present variables. Before using the mutate() function, you need to install the dplyr library.

How do I mutate a dataset in R?

To use mutate in R, all you need to do is call the function, specify the dataframe, and specify the name-value pair for the new variable you want to create.

What is mutate in dplyr?

mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name.


2 Answers

I would use first() and last() (both dplyr function) and na.omit() (from the default stats package.

First, I would make sure your score column was a numberic column with proper NA values (not strings as in your example)

help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
       ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
       score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))

then you can do

library(dplyr)
help %>% group_by(id) %>% arrange(ob) %>% 
    mutate(es=first(na.omit(score)-last(na.omit(score))))
like image 171
MrFlick Avatar answered Oct 05 '22 18:10

MrFlick


library(dplyr)

temp <- help %>% group_by(id) %>% 
     arrange(ob) %>%
     filter(!is.na(score)) %>% 
     mutate(es = first(score) - last(score)) %>%
     select(id, es) %>%
     distinct()

help %>% left_join(temp)
like image 30
Yifei Avatar answered Oct 05 '22 17:10

Yifei