Let's say I have a dataframe:
df <- data.frame(group = c('A','A','A','B','B','B'),
time = c(1,2,4,1,2,3),
data = c(5,6,7,8,9,10))
What I want to do is insert data into the data frame where it was missing in the sequence. So in the above example, I'm missing data for time
= 3 for group A, and time
= 4 for Group B. I would essentially want to put 0's in the place of the data
column.
How would I go about adding these additional rows?
The goal would be:
df <- data.frame(group = c('A','A','A','A','B','B','B','B'),
time = c(1,2,3,4,1,2,3,4),
data = c(5,6,0,7,8,9,10,0))
My real data is a couple thousand data points, so manually doing so isn't possible.
To add missing dates to Python Pandas DataFrame, we can use the DatetimeIndex instance's reindex method. We create a date range index with idx = pd. date_range('09-01-2020', '09-30-2020') .
In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.
One way to impute missing values in a time series data is to fill them with either the last or the next observed values. Pandas have fillna() function which has method parameter where we can choose “ffill” to fill with the next observed value or “bfill” to fill with the previously observed value.
You can try merge/expand.grid
res <- merge(
expand.grid(group=unique(df$group), time=unique(df$time)),
df, all=TRUE)
res$data[is.na(res$data)] <- 0
res
# group time data
#1 A 1 5
#2 A 2 6
#3 A 3 0
#4 A 4 7
#5 B 1 8
#6 B 2 9
#7 B 3 10
#8 B 4 0
Or using data.table
library(data.table)
setkey(setDT(df), group, time)[CJ(group=unique(group), time=unique(time))
][is.na(data), data:=0L]
# group time data
#1: A 1 5
#2: A 2 6
#3: A 3 0
#4: A 4 7
#5: B 1 8
#6: B 2 9
#7: B 3 10
#8: B 4 0
As @thelatemail mentioned in the comments, the above method would fail if a particular 'time' value is not present in all the groups. May be this would be more general.
res <- merge(
expand.grid(group=unique(df$group),
time=min(df$time):max(df$time)),
df, all=TRUE)
res$data[is.na(res$data)] <- 0
and similarly replace time=unique(time)
with time= min(time):max(time)
in the data.table solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With