I am dealing with time series data where I need to have continuous time stamps but few of the data timestamp points has been missed while capturing like as below,
DF
ID Time_Stamp A B C
1 02/02/2018 07:45:00 123 567 434
2 02/02/2018 07:45:01
..... ...
5 02/02/2018 07:46:00
6 02/02/2018 07:46:10 112 2323 2323
As shown in the sample df above, time stamps is continuous till row 5 but missed capturing data of 10 seconds between 5th and 6th row. My data frame is about 60000 rows and identifying missing values manually is tedious.
Hence I was looking for automating the procedure of handling missing values using R
My result data frame is as below,
ID Time_Stamp A B C
1 02/02/2018 07:45:00 123 567 434
2 02/02/2018 07:45:01
..... ...
5 02/02/2018 07:46:00 mean(A)
5.1 02/02/2018 07:46:01 mean(A) mean(b) mean(c)
5.2 02/02/2018 07:46:02 mean(A) mean(b) mean(c)
5.3 02/02/2018 07:46:03 mean(A) mean(b) mean(c)
5.4 02/02/2018 07:46:04 mean(A) mean(b) mean(c)
5.5 02/02/2018 07:46:05 mean(A) mean(b) mean(c)
5.6 02/02/2018 07:46:06 mean(A) mean(b) mean(c)
5.7 02/02/2018 07:46:07 mean(A) mean(b) mean(c)
5.8 02/02/2018 07:46:08 mean(A) mean(b) mean(c)
5.9 02/02/2018 07:46:09 mean(A) mean(b) mean(c)
6 02/02/2018 07:46:10 112 2323 2323
Kindly Help!
It is always better to have a specific example showing specific expected output so that there is little space for ambiguity and assumption. However, I have created a dummy data based on my understanding and tried to solve it accordingly.
If I have understood you correctly, you have time series data with data point every second but sometimes there are some seconds missing which you want to fill it with mean of that column.
We can achieve this using complete by generating a sequence for every second between the min and max Time_Stamp and fill the missing values by the mean in the respective column. ID looks like an unique identifier for each row so filled it with row_number().
library(dplyr)
library(tidyr)
df %>%
complete(Time_Stamp = seq(min(Time_Stamp), max(Time_Stamp), by = "sec")) %>%
mutate_at(vars(A:C), ~replace(., is.na(.), mean(., na.rm = TRUE))) %>%
mutate(ID = row_number())
# A tibble: 11 x 5
# Time_Stamp ID A B C
# <dttm> <int> <dbl> <dbl> <dbl>
# 1 2018-02-02 07:45:00 1 123 567 434
# 2 2018-02-02 07:45:01 2 234 100 110
# 3 2018-02-02 07:45:02 3 234 100 110
# 4 2018-02-02 07:45:03 4 176. 772. 744.
# 5 2018-02-02 07:45:04 5 176. 772. 744.
# 6 2018-02-02 07:45:05 6 176. 772. 744.
# 7 2018-02-02 07:45:06 7 176. 772. 744.
# 8 2018-02-02 07:45:07 8 176. 772. 744.
# 9 2018-02-02 07:45:08 9 176. 772. 744.
#10 2018-02-02 07:45:09 10 176. 772. 744.
#11 2018-02-02 07:45:10 11 112 2323 2323
If you check the column means for last 3 columns, you can see those value are accurately replaced.
colMeans(df[3:5])
# A B C
#175.75 772.50 744.25
data
df <- structure(list(ID = 1:4, Time_Stamp = structure(c(1517557500,
1517557501, 1517557502, 1517557510), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), A = c(123L, 234L, 234L, 112L), B = c(567L,
100L, 100L, 2323L), C = c(434L, 110L, 110L, 2323L)), class = "data.frame",
row.names = c(NA, -4L))
which looks like
df
# ID Time_Stamp A B C
#1 1 2018-02-02 07:45:00 123 567 434
#2 2 2018-02-02 07:45:01 234 100 110
#3 3 2018-02-02 07:45:02 234 100 110
#4 4 2018-02-02 07:45:10 112 2323 2323
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With