Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

conditional filtering / subseting data in linear distance data in r

Here is my small example: ...........

Mark <- paste ("SN", 1:400, sep = "") 
highway <- rep(1:4, each = 100)
set.seed (1234)
MAF <- rnorm (400, 0.3, 0.1)
PPC <- abs (ceiling( rnorm (400, 5, 5)))

set.seed (1234)
Position  <- round(c(cumsum (rnorm (100, 5, 3)), 
cumsum (rnorm (100, 10, 3)), cumsum (rnorm (100, 8, 3)),
  cumsum (rnorm (100, 6, 3))), 1)

mydf <- data.frame (Mark, highway, Position, MAF, PPC)

I want to filter data which is less than 10 for PPC at the sametime greater than 0.3 for MAF.

  # filter PPC < 10 & MAF > 0.3 
 filtered <-  mydf[mydf$PPC < 10  & mydf$MAF > 0.3,]

I have grouping variable - highway and each Mark has Position on the highway. For example highway 1 for first five marks:

      1.4     7.2      15.5 13.4 19.7
 |-----|.......|.......|.....|.....|
      "SN1" "SN2"   "SN3"  "SN4" "SN5"

Now I want to pick any ~ 30 Marks such that they are well distributed in each highway based on the Position on each highway (consider different length of highway) and minimum distance between two picks is not less than 10.

Edit: The idea (rough sketch) enter image description here

I could think a little bit on how to solve this question. Help appreciated.

Edits: Here something I could figure out:

# The maximum (length) of each highway is: 
out <-  tapply(mydf$Position, mydf$highway, max)
out 
     1      2      3      4 
 453.0 1012.4  846.4  597.6 

min(out)
[1] 453

 #Total length of all highways 
totallength <- sum(out)

# Thus average distance at which mark need to be placed:
totallength / 30 
[1] 96.98 

For highway 1, the theoritical marks could be at:

 96.98, 96.98+ 96.98, 96.98+96.98+ 96.98, ........till it is less
    than maximum (length )for highway 1.

Thus theoritically we need to choose mark at every 96.98. But the marks placed in highway may not be foud at

note: the total outcome of selection of marks need not be exactly 30 (around 30)

like image 849
shNIL Avatar asked Jul 06 '12 13:07

shNIL


1 Answers

Since we aren't bothered about any other columns, the code is a little easier if we use split to get a list of positions.

filtered$highway <- factor(filtered$highway)
positions <- with(filtered, split(Position, highway))

A suitable number of marks in each highway can be found using the relative length of each highway.

highway_lengths <- sapply(positions, max)
total_length <- sum(highway_lengths)
n_marks_per_highway <- round(30 * highway_lengths / total_length)

We can use the quantile function to get target points that are evenly spaced along each highway.

target_mark_points <- mapply(
  function(pos, n)
  {
    quantile(pos, seq.int(0, 1, 1 / (n - 1)))
  },
  positions,
  n_marks_per_highway
)

For each target point, we find the nearest existing mark in the highway.

actual_mark_points <- mapply(
  function(pos, target)  
  {
    sapply(target, function(tgt) 
    {
      d <- abs(tgt - pos)
      pos[which.min(d)]
    })
  },
  positions,
  target_mark_points
)

Just to see that it works, you can visualise the marks.

is_mark_point <- mapply(
  function(pos, mark)
  {
    pos %in% mark
  },
  positions,
  actual_mark_points
)

filtered$is.mark.point <- unsplit(is_mark_point, filtered$highway)

library(ggplot2)    
(p <- ggplot(filtered, aes(Position, highway, colour = is.mark.point)) +
  geom_point()
)
like image 162
Richie Cotton Avatar answered Nov 09 '22 13:11

Richie Cotton