I am new to R, In my data Frame I have col1("Timestamp"), col2("Values"). I have to remove rows of more than 2 consecutive NA in col2. My dataframe Looks like the below one,
Timestamp | values
-----------|--------
2011-01-02 | 2
2011-01-03 | 3
2011-01-04 | NA
2011-01-05 | 1
2011-01-06 | NA
2011-01-07 | NA
2011-01-08 | 8
2011-01-09 | 6
2011-01-10 | NA
2011-01-11 | NA
2011-01-12 | NA
2011-01-13 | 2
I would like to remove more than 2 duplicate rows based on second column. Expected output -
Timestamp | values
-----------|--------
2011-01-02 | 2
2011-01-03 | 3
2011-01-04 | NA
2011-01-05 | 1
2011-01-06 | NA
2011-01-07 | NA
2011-01-08 | 8
2011-01-09 | 6
2011-01-13 | 2
I'm looking for the solution thanks in advance.
You can use the run length encoding function rle
. I assume that the data is already sorted by date.
r <- rle(is.na(df$values)) # check runs of NA in value column
df[!rep(r$values & r$lengths > 2, r$lengths),] # remove runs of >2 length
Here is another option using rleid
from data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by the run-length-id of 'values', we get the row index (.I
) where the number of rows is greater than 2 (.N >2
) and (&
) all
the 'values' are 'NA'. Extract the index ($V1
) to subset the rows of original dataset.
library(data.table)
setDT(df1)[df1[, .I[!(.N >2 & all(is.na(values)))], rleid(is.na(values))]$V1]
# Timestamp values
#1: 2011-01-02 2
#2: 2011-01-03 3
#3: 2011-01-04 NA
#4: 2011-01-05 1
#5: 2011-01-06 NA
#6: 2011-01-07 NA
#7: 2011-01-08 8
#8: 2011-01-09 6
#9: 2011-01-13 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With