Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - how to fill NA's between two corresponding ID's in a dataframe

Tags:

r

data-munging

I am trying to take the the following dataset and convert it to the second. Basically, I'm trying to fill in the NA's between each ID with that ID.

Each ID corresponds to two time stamps, which I've joined in to a larger date_time column. For purposes of reproduce-ability, it's too computationally expensive to do a sql between join (the date_time column is very large) or even to take the original dataset and create the time stamps between each id and then join it in (I have too many ID's to do this). I've successfully completed both those methods and it just takes too much time for the amount of data I have. I'm hoping to manipulate the data with this data set. It seems like such a simple thing, but it's really got me stumped. Any help would be appreciated.

Current data set:

             date_time     id
                <dttm>  <dbl>
 1 2017-01-30 08:00:00     NA
 2 2017-01-30 08:00:01     NA
 3 2017-01-30 08:00:02     1
 4 2017-01-30 08:00:03     NA
 5 2017-01-30 08:00:04     NA
 6 2017-01-30 08:00:05     NA
 7 2017-01-30 08:00:06     NA
 8 2017-01-30 08:00:07     1
 9 2017-01-30 08:00:08     NA
10 2017-01-30 08:00:09     NA
11 2017-01-30 08:00:10     2
12 2017-01-30 08:00:11     NA
13 2017-01-30 08:00:12     NA
14 2017-01-30 08:00:13     NA
15 2017-01-30 08:00:14     2
16 2017-01-30 08:00:15     NA
17 2017-01-30 08:00:16     3
18 2017-01-30 08:00:17     NA
19 2017-01-30 08:00:18     3
20 2017-01-30 08:00:19     NA

Desired dataset:

            date_time     id
                <dttm>  <dbl>
 1 2017-01-30 08:00:00     NA
 2 2017-01-30 08:00:01     NA
 3 2017-01-30 08:00:02     1
 4 2017-01-30 08:00:03     1
 5 2017-01-30 08:00:04     1
 6 2017-01-30 08:00:05     1
 7 2017-01-30 08:00:06     1
 8 2017-01-30 08:00:07     1
 9 2017-01-30 08:00:08     NA
10 2017-01-30 08:00:09     NA
11 2017-01-30 08:00:10     2
12 2017-01-30 08:00:11     2
13 2017-01-30 08:00:12     2
14 2017-01-30 08:00:13     2
15 2017-01-30 08:00:14     2
16 2017-01-30 08:00:15     NA
17 2017-01-30 08:00:16     3
18 2017-01-30 08:00:17     3
19 2017-01-30 08:00:18     3
20 2017-01-30 08:00:19     NA

dput() date:

structure(list(date_time = structure(c(1485781200, 1485781201, 
1485781202, 1485781203, 1485781204, 1485781205, 1485781206, 1485781207, 
1485781208, 1485781209, 1485781210, 1485781211, 1485781212, 1485781213, 
1485781214, 1485781215, 1485781216, 1485781217, 1485781218, 1485781219
), class = c("POSIXct", "POSIXt"), tzone = ""), trx_id = c(NA_real_, 
NA_real_, 1, NA_real_, NA_real_, NA_real_, NA_real_, 1, 
NA_real_, NA_real_, 2, NA_real_, NA_real_, NA_real_, 2, 
NA_real_, 3, NA_real_, 3, NA_real_)), .Names = c("date_time", 
"trx_id"), row.names = c(NA, -20L), class = c("tbl_df", "tbl", 
"data.frame"))
like image 706
jstauss Avatar asked Dec 05 '22 12:12

jstauss


1 Answers

One solution could be using fill function from tidyr. The approach is simple. First create 2 columns one each for prev and next values. Use fill to populate missing values in both columns.

Now, for rows which has same value in both prev_val and next_val the value should be updated with prev_val (that means those missing values are between same number)

df <-  read.table(text = "sl date_time, value
1 '2017-01-30 08:00:00'     NA
2 '2017-01-30 08:00:01'     NA
3 '2017-01-30 08:00:02'     1
4 '2017-01-30 08:00:03'     NA
5 '2017-01-30 08:00:04'     NA
6 '2017-01-30 08:00:05'     NA
7 '2017-01-30 08:00:06'     NA
8 '2017-01-30 08:00:07'     1
9 '2017-01-30 08:00:08'     NA
10 '2017-01-30 08:00:09'     NA
11 '2017-01-30 08:00:10'     2
12 '2017-01-30 08:00:11'     NA
13 '2017-01-30 08:00:12'     NA
14 '2017-01-30 08:00:13'     NA
15 '2017-01-30 08:00:14'     2
16 '2017-01-30 08:00:15'     NA
17 '2017-01-30 08:00:16'     3
18 '2017-01-30 08:00:17'     NA
19 '2017-01-30 08:00:18'     3
20 '2017-01-30 08:00:19'     NA", header = T, stringsAsFactor = F)

#use fill to find missing values
df %>%
  mutate(prev_val = (value), next_val = (value)) %>%
  fill(prev_val, .direction = "down") %>%
  fill(next_val, .direction = "up") %>%
  mutate(value = ifelse(prev_val == next_val, prev_val, value )) %>%
  select(-prev_val, -next_val)

Result:
   sl          date_time. value
1   1 2017-01-30 08:00:00    NA
2   2 2017-01-30 08:00:01    NA
3   3 2017-01-30 08:00:02     1
4   4 2017-01-30 08:00:03     1
5   5 2017-01-30 08:00:04     1
6   6 2017-01-30 08:00:05     1
7   7 2017-01-30 08:00:06     1
8   8 2017-01-30 08:00:07     1
9   9 2017-01-30 08:00:08    NA
10 10 2017-01-30 08:00:09    NA
11 11 2017-01-30 08:00:10     2
12 12 2017-01-30 08:00:11     2
13 13 2017-01-30 08:00:12     2
14 14 2017-01-30 08:00:13     2
15 15 2017-01-30 08:00:14     2
16 16 2017-01-30 08:00:15    NA
17 17 2017-01-30 08:00:16     3
18 18 2017-01-30 08:00:17     3
19 19 2017-01-30 08:00:18     3
20 20 2017-01-30 08:00:19    NA
like image 155
MKR Avatar answered Jun 10 '23 22:06

MKR