Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identify duplicate data with a threshold

Tags:

r

I am working with bluetooth sensor data and need to identify possible duplicate readings for each unique ID. The bluetooth sensor made a scan every five seconds, and may pick up the same device in subsequent readings if the device wasn't moving quickly (i.e. sitting in traffic). There may be multiple readings from the same device if that device made a round trip, but those should be separated by several minutes. I can't wrap my head around how to get rid of the duplicate data. I need to calculate a time difference column if the macid's match.

The data has the format:

          macid   time
00:03:7A:4D:F3:59  82333
00:03:7A:EF:58:6F 223556
00:03:7A:EF:58:6F 223601
00:03:7A:EF:58:6F 232731
00:03:7A:EF:58:6F 232736
00:05:4F:0B:45:F7 164141

And I need to create:

            macid   time timediff
00:03:7A:4D:F3:59  82333 NA
00:03:7A:EF:58:6F 223556 NA
00:03:7A:EF:58:6F 223601 45
00:03:7A:EF:58:6F 232731 9310
00:03:7A:EF:58:6F 232736 5
00:05:4F:0B:45:F7 164141 NA

My first attempt at this is extremely slow and not really usable:

dedupeIDs <- function (zz) {
  #Order by macid and then time
  zz <- zz[order(zz$macid, zz$time) ,]

  zz$timediff <- c(999999, diff(zz$time))

  for (i in 2:nrow(zz)) {
   if (zz[i, "macid"] == zz[i - 1, "macid"]) {
    print("Different IDs")
   } else {
    zz[i, "timediff"] <- 999999
   }
  }
  return(zz)
}

I'll then be able to filter the data.frame based on the time difference column.

Sample data:

structure(list(macid = structure(c(1L, 2L, 2L, 2L, 2L, 3L),
          .Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", 
                     "00:05:4F:0B:45:F7"), class = "factor"), 
          time = c(82333, 223556, 223601, 232731, 232736, 164141)), 
          .Names = c("macid", "time"), row.names = c(NA, -6L), 
          class = "data.frame")
like image 625
Chase Avatar asked Apr 01 '11 19:04

Chase


1 Answers

How about:

x <- structure(list(macid= structure(c(1L, 2L, 2L, 2L, 2L, 3L),
 .Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", "00:05:4F:0B:45:F7"),
 class = "factor"), time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L), class = "data.frame")
# ensure 'x' is ordered properly
x <- x[order(x$macid,x$time),]
# add timediff column by macid
x$timediff <- ave(x$time, x$macid, FUN=function(x) c(NA,diff(x)))
like image 91
Joshua Ulrich Avatar answered Sep 20 '22 02:09

Joshua Ulrich