I am dealing with a big timeseries with one column containing four different sensors and one column containing the mesured values. I need to assign an id to measurements that belong to the same time. The problem is, that the timing of measurements differs slightly for each device, thus i cannot simply group them by timestamp. In a data frame ordered by time, measurements that should be grouped can be identified by sequences of unique device Ids. The problem here is, that at one time 4 devices record a value and another time 3 devices record a value. My data looks like this.
timestamp device measurement
1 2019-08-27 07:29:20.671313 sdr_03 49.868820
2 2019-08-27 07:29:20.932043 sdr_02 54.160831
3 2019-08-27 07:29:21.839312 sdr_03 48.974476
4 2019-08-27 07:29:21.850454 sdr_02 50.808674
5 2019-08-27 08:57:01.990833 sdr_03 50.533058
6 2019-08-27 08:57:02.022798 sdr_04 51.143322
7 2019-08-27 09:16:56.454308 sdr_02 57.447151
8 2019-08-27 09:16:56.482433 sdr_04 50.012745
9 2019-08-27 09:16:56.761776 sdr_01 71.500305
10 2019-08-27 09:16:57.305510 sdr_02 56.851177
11 2019-08-27 09:16:57.333628 sdr_04 60.390141
12 2019-08-27 09:16:57.612972 sdr_01 73.470345
which you can reproduce with this:
my_data<-data.frame(timestamp = c("2019-08-27 07:29:20.671313","2019-08-27 07:29:20.932043","2019-08-27 07:29:21.839312",
"2019-08-27 07:29:21.850454", "2019-08-27 08:57:01.990833","2019-08-27 08:57:02.022798",
"2019-08-27 09:16:56.454308", "2019-08-27 09:16:56.482433", "2019-08-27 09:16:56.761776",
"2019-08-27 09:16:57.305510" ,"2019-08-27 09:16:57.333628", "2019-08-27 09:16:57.612972"),
device=c("sdr_03", "sdr_02", "sdr_03", "sdr_02", "sdr_03" ,"sdr_04", "sdr_02", "sdr_04" ,"sdr_01", "sdr_02" ,"sdr_04",
"sdr_01"),
measurement=c(49.868820, 54.160831, 48.974476, 50.808674, 50.533058, 51.143322,57.447151,50.012745, 71.500305,56.851177,
60.390141, 73.470345)
)
I need to assign the same value to consecutive rows as long as none of the elements in the previous rows of column device appears again
timestamp device measurement match_id
1 2019-08-27 07:29:20.671313 sdr_03 49.868820 1
2 2019-08-27 07:29:20.932043 sdr_02 54.160831 1
3 2019-08-27 07:29:21.839312 sdr_03 48.974476 2
4 2019-08-27 07:29:21.850454 sdr_02 50.808674 2
5 2019-08-27 08:57:01.990833 sdr_03 50.533058 3
6 2019-08-27 08:57:02.022798 sdr_04 51.143322 3
7 2019-08-27 09:16:56.454308 sdr_02 57.447151 3
8 2019-08-27 09:16:56.482433 sdr_04 50.012745 4
9 2019-08-27 09:16:56.761776 sdr_01 71.500305 4
10 2019-08-27 09:16:57.305510 sdr_02 56.851177 4
11 2019-08-27 09:16:57.333628 sdr_04 60.390141 5
12 2019-08-27 09:16:57.612972 sdr_01 73.470345 5
which you can get from:
my_data<-data.frame(timestamp = c("2019-08-27 07:29:20.671313","2019-08-27 07:29:20.932043","2019-08-27 07:29:21.839312",
"2019-08-27 07:29:21.850454", "2019-08-27 08:57:01.990833","2019-08-27 08:57:02.022798",
"2019-08-27 09:16:56.454308", "2019-08-27 09:16:56.482433", "2019-08-27 09:16:56.761776",
"2019-08-27 09:16:57.305510" ,"2019-08-27 09:16:57.333628", "2019-08-27 09:16:57.612972"),
device=c("sdr_03", "sdr_02", "sdr_03", "sdr_02", "sdr_03" ,"sdr_04", "sdr_02", "sdr_04" ,"sdr_01", "sdr_02" ,"sdr_04",
"sdr_01"),
measurement=c(49.868820, 54.160831, 48.974476, 50.808674, 50.533058, 51.143322,57.447151,50.012745, 71.500305,56.851177,
60.390141, 73.470345),match_id=c(1,1,2,2,3,3,3,4,4,4,5,5) )
I have been searching for answers for three days now. Any help is very much appreciated.
Allan Camerons dplyr solution results in match ids that reappear later in the data frame- see lines 1,2,6,9. There may be less than 4 devices recording at one time, thus solutions that always expect the same number of recording devices for each measurement won't work.
# A tibble: 12 x 4
# Groups: device [4]
timestamp device measurement new_id
<dttm> <fct> <dbl> <int>
1 2019-08-27 07:29:20.671313 sdr_03 49.9 1
2 2019-08-27 07:29:20.932043 sdr_02 54.2 1
3 2019-08-27 07:29:21.839312 sdr_03 49.0 2
4 2019-08-27 07:29:21.850454 sdr_02 50.8 2
5 2019-08-27 08:57:01.990833 sdr_03 50.5 3
6 2019-08-27 08:57:02.022798 sdr_04 51.1 1
7 2019-08-27 09:16:56.454308 sdr_02 57.4 3
8 2019-08-27 09:16:56.482433 sdr_04 50.0 2
9 2019-08-27 09:16:56.761775 sdr_01 71.5 1
10 2019-08-27 09:16:57.305510 sdr_02 56.9 4
11 2019-08-27 09:16:57.333627 sdr_04 60.4 3
12 2019-08-27 09:16:57.612972 sdr_01 73.5 2
While Sotos solution results in more consecutive match ids than unique devices exist. E.g. lines 5-9
# A tibble: 12 x 4
timestamp device measurement new_id
<chr> <fct> <dbl> <int>
1 2019-08-27 07:29:20 sdr_03 49.9 1
2 2019-08-27 07:29:20 sdr_02 54.2 1
3 2019-08-27 07:29:21 sdr_03 49.0 2
4 2019-08-27 07:29:21 sdr_02 50.8 2
5 2019-08-27 08:57:01 sdr_03 50.5 3
6 2019-08-27 08:57:02 sdr_04 51.1 3
7 2019-08-27 09:16:56 sdr_02 57.4 3
8 2019-08-27 09:16:56 sdr_04 50.0 3
9 2019-08-27 09:16:56 sdr_01 71.5 3
10 2019-08-27 09:16:57 sdr_02 56.9 4
11 2019-08-27 09:16:57 sdr_04 60.4 4
12 2019-08-27 09:16:57 sdr_01 73.5 4
Both solutions work great (thanks!) if timediffs between measurements are >0.7 sec or 4 devices recorded at the same time. Sadly, most of the time this is not the case. I think, a solution that ignores timestamps and rather checks for duplicates in consecutive rows might be better. I found many solutions for repeated values using rle() or data.table, but no solution to identify sequences of unique values. Please help me out here!
I am pretty sure I really overthought it, but it's a working solution,
library(dplyr)
data %>%
mutate(timestamp = format(timestamp, '%Y-%m-%d %H:%M:%S')) %>%
group_by(timestamp) %>%
mutate(new = data.table::rleid(duplicated(device))) %>%
group_by(timestamp, new) %>%
mutate(new1 = row_number() + new) %>%
ungroup() %>%
mutate(new_id = cumsum(c(TRUE, diff(new1) < 0))) %>%
select(-c(new, new1))
which gives,
# A tibble: 12 x 4 timestamp device measurement new_id <fct> <fct> <dbl> <int> 1 2019-08-27 09:48:54 sdr_02 80.2 1 2 2019-08-27 09:48:54 sdr_01 71.7 1 3 2019-08-27 09:48:54 sdr_04 74.2 1 4 2019-08-27 09:48:54 sdr_03 62.6 1 5 2019-08-27 09:48:55 sdr_02 77.1 2 6 2019-08-27 09:48:55 sdr_01 69.2 2 7 2019-08-27 09:48:55 sdr_03 62.1 2 8 2019-08-27 09:48:55 sdr_02 77.1 3 9 2019-08-27 09:48:55 sdr_01 54.6 3 10 2019-08-27 09:48:55 sdr_03 64.3 3 11 2019-08-27 09:48:56 sdr_02 66.5 4 12 2019-08-27 09:48:56 sdr_01 71.7 4
Couldn't this be done more simply?
library(dplyr)
df %>%
group_by(device) %>%
mutate(new_id = seq_len(length(device)), timestamp = as.POSIXct(timestamp))
#> # A tibble: 12 x 4
#> # Groups: device [4]
#> timestamp device measurement new_id
#> <dttm> <fct> <dbl> <int>
#> 1 2019-08-27 09:48:54 sdr_02 80.2 1
#> 2 2019-08-27 09:48:54 sdr_01 71.7 1
#> 3 2019-08-27 09:48:54 sdr_04 74.2 1
#> 4 2019-08-27 09:48:54 sdr_03 62.6 1
#> 5 2019-08-27 09:48:55 sdr_02 77.1 2
#> 6 2019-08-27 09:48:55 sdr_01 69.2 2
#> 7 2019-08-27 09:48:55 sdr_03 62.1 2
#> 8 2019-08-27 09:48:55 sdr_02 77.1 3
#> 9 2019-08-27 09:48:55 sdr_01 54.6 3
#> 10 2019-08-27 09:48:55 sdr_03 64.3 3
#> 11 2019-08-27 09:48:56 sdr_02 66.5 4
#> 12 2019-08-27 09:48:56 sdr_01 71.7 4
UPDATE
Based on the OP's comments, it seems the best way to do this is to just define a function that keeps a running tally of devices it has encountered and increments whenever it reaches a duplicate.
# Code # Pseudocode
# ======================================= # ===================================
group_instances <- function(my_labels) #
{ #
my_labels <- as.character(my_labels) # (Ensure we use a character vector)
#
result <- numeric(length(my_labels)) # Create a numeric result vector
matches <- as.character(my_labels[1]) # Create tally of encountered devices
#
for(i in seq_along(my_labels)[-1]) # For each device record after the first
{ #
if(my_labels[i] %in% matches) # If we have this device in our tally
{ #
matches <- my_labels[i] # Reset our tally of devices
result[i] <- result[i - 1] + 1 # and increment our ID
} #
else # Otherwise
{ #
matches <- c(matches, my_labels[i]) # Add it to our tally of devices
result[i] <- result[i - 1] # and copy the ID from the row above
} #
} #
return(result + 1) # Our IDs started at zero, so add one
}
Now we can do
my_data %>% mutate(ID = as.factor(group_instances(device)))
#> timestamp device measurement ID
#> 1 2019-08-27 07:29:20.671313 sdr_03 49.86882 1
#> 2 2019-08-27 07:29:20.932043 sdr_02 54.16083 1
#> 3 2019-08-27 07:29:21.839312 sdr_03 48.97448 2
#> 4 2019-08-27 07:29:21.850454 sdr_02 50.80867 2
#> 5 2019-08-27 08:57:01.990833 sdr_03 50.53306 3
#> 6 2019-08-27 08:57:02.022798 sdr_04 51.14332 3
#> 7 2019-08-27 09:16:56.454308 sdr_02 57.44715 3
#> 8 2019-08-27 09:16:56.482433 sdr_04 50.01275 4
#> 9 2019-08-27 09:16:56.761776 sdr_01 71.50030 4
#> 10 2019-08-27 09:16:57.305510 sdr_02 56.85118 4
#> 11 2019-08-27 09:16:57.333628 sdr_04 60.39014 5
#> 12 2019-08-27 09:16:57.612972 sdr_01 73.47034 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With