Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill missing sequence values with dplyr

Tags:

r

dplyr

I have a data frame with missing values for "SNAP_ID". I'd like to fill in the missing values with floating point values based on a sequence from the previous non-missing value (lag()?). I would really like to achieve this using just dplyr if possible.

Assumptions:

  1. There will never be missing data as the first or last row I'm generating the missing dates based on missing days between a min and max of a data set
  2. There can be multiple gaps in the data set

Current data:

                  end SNAP_ID
1 2015-06-26 12:59:00     365
2 2015-06-26 13:59:00     366
3 2015-06-27 00:01:00      NA
4 2015-06-27 23:00:00      NA
5 2015-06-28 00:01:00      NA
6 2015-06-28 23:00:00      NA
7 2015-06-29 09:00:00     367
8 2015-06-29 09:59:00     368

What I want to achieve:

                  end SNAP_ID
1 2015-06-26 12:59:00     365.0
2 2015-06-26 13:59:00     366.0
3 2015-06-27 00:01:00     366.1
4 2015-06-27 23:00:00     366.2
5 2015-06-28 00:01:00     366.3
6 2015-06-28 23:00:00     366.4
7 2015-06-29 09:00:00     367.0
8 2015-06-29 09:59:00     368.0

As a data frame:

df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260, 
    1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end", 
    "SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")

This was my attempt at achieving this goal, but it only works for the first missing value:

df %>% 
  arrange(end) %>%
  mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID))

                  end SNAP_ID
1 2015-06-26 12:59:00   365.0
2 2015-06-26 13:59:00   366.0
3 2015-06-27 00:01:00   366.1
4 2015-06-27 23:00:00      NA
5 2015-06-28 00:01:00      NA
6 2015-06-28 23:00:00      NA
7 2015-06-29 09:00:00   367.0
8 2015-06-29 09:59:00   368.0

The outstanding answer from @mathematical.coffee below:

df %>% 
  arrange(end) %>%
  group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
  mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>%
  ungroup() %>%
  select(-tmp)
like image 447
Tyler Muth Avatar asked Jul 16 '15 22:07

Tyler Muth


People also ask

Which function is used for filling NA value with consecutive values in R?

The fillna() function is used to fill NA/NaN values using the specified method.

How do I replace missing values in R?

To replace missing values in R with the minimum, you can use the tidyverse package. Firstly, you use the mutate() function to specify the column in which you want to replace the missing values. Secondly, you call the replace() function to identify the NA's and to substitute them with the column lowest value.

What is fill in R?

Source: R/fill.R. fill.Rd. Fills missing values in selected columns using the next or previous entry. This is useful in the common output format where values are not repeated, and are only recorded when they change.

What does fill () do in dplyr?

fill() fills the NAs (missing values) in selected columns (dplyr::select() options could be used like in the below example with everything()). It also lets us select the .direction either down (default) or up or updown or downup from where the missing value must be filled.

What is replace_Na () in dplyr?

This is a wrapper around expand () , dplyr::left_join () and replace_na () that's useful for completing missing combinations of data. A data frame. ... Specification of columns to expand. Columns can be atomic vectors or lists.

How do I fill missing values in a data frame?

A data frame. ... < tidy-select > Columns to fill. Direction in which to fill missing values. Currently either "down" (the default), "up", "downup" (i.e. first down and then up) or "updown" (first up and then down).

How do you fill in missing values in a variable?

When used with continuous variables, you may need to fill in values that do not appear in the data: to do so use expressions like year = 2010:2020 or year = full_seq (year,1). A named list that for each variable supplies a single value to use instead of NA for missing combinations.


1 Answers

EDIT: new version works for any number of NA runs. This one doesn't need zoo, either.

First, notice that tmp=cumsum(!is.na(SNAP_ID)) groups the SNAP_IDs such groups of the same tmp consist of one non-NA value followed by a run of NA values.

Then group by this variable and just add .1 to the first SNAP_ID to fill out the NAs:

df %>% 
  arrange(end) %>%
  group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
  mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1)))

                  end SNAP_ID tmp
1 2015-06-26 12:59:00   365.0   1
2 2015-06-26 13:59:00   366.0   2
3 2015-06-27 00:01:00   366.1   2
4 2015-06-27 23:00:00   366.2   2
5 2015-06-28 00:01:00   366.3   2
6 2015-06-28 23:00:00   366.4   2
7 2015-06-29 09:00:00   367.0   3
8 2015-06-29 09:59:00   368.0   4

Then you can drop the tmp column afterwards (add %>% select(-tmp) to the end).


EDIT: this is the old version which doesn't work for subsequent runs of NAs.

If your aim is to fill each NA with the previous value + 0.1, you can use zoo's na.locf (which fills each NA with the previous value), along with cumsum(is.na(SNAP_ID))*0.1 to add the extra 0.1.

library(zoo)
df %>% 
  arrange(end) %>%
  mutate(SNAP_ID=ifelse(is.na(SNAP_ID),
                       na.locf(SNAP_ID) + cumsum(is.na(SNAP_ID))*0.1,
                       SNAP_ID))
like image 185
mathematical.coffee Avatar answered Oct 11 '22 18:10

mathematical.coffee