Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subset vector based on string character?

Tags:

string

r

I have a vector composed of entries such as "ZZZ1Z01Z0ZZ0", "1001ZZ0Z00Z0", and so on, and I want to subset this vector based on conditions such as:

  1. The third character is a Z
  2. The third AND seventh characters are Z
  3. The third AND seventh characters are Z, AND none of the other characters are Z

I tried playing around with strsplit and grep, but I couldn't figure out a way to restrict my conditions based on the position of the character on the string. Any suggestions?

Many thanks!

like image 232
Rafael Maia Avatar asked Nov 23 '11 15:11

Rafael Maia


2 Answers

You can do this with regular expressions (see ?regexp for details on regular expressions).

grep returns the location of the match and returns a zero-length vector if no match is found. You may want to use grepl instead, since it returns a logical vector you can use to subset.

z <- c("ZZZ1Z01Z0ZZ0", "1001ZZ0Z00Z0")
# 3rd character is Z ("^" is start of string, "." is any character)
grep("^..Z", z)
# 3rd and 7th characters are Z
grep("^..Z...Z", z)
# 3rd and 7th characters are Z, no other characters are Z
# "[]" defines a "character class" and "^" in a character class negates the match
# "{n}" repeats the preceding match n times, "+" repeats is one or more times
grep("^[^Z]{2}Z[^Z]{3}Z[^Z]+", z)
like image 194
Joshua Ulrich Avatar answered Sep 28 '22 08:09

Joshua Ulrich


Expanding Josh's answer, you want

your_dataset <- data.frame(
  z = c("ZZZ1Z01Z0ZZ0", "1001ZZ0Z00Z0")
)
regexes <- c("^..Z", "^..Z...Z", "^[^Z]{2}Z[^Z]{3}Z[^Z]+")

lapply(regexes, function(rx)
{
  subset(your_dataset, grepl(rx, z))
})

Also consider replacing grepl(rx, z) with str_detect(z, rx), using the stringr package. (There's no real difference except for slightly more readable code.)

like image 23
Richie Cotton Avatar answered Sep 28 '22 09:09

Richie Cotton