Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex grab from beginning to n occurrence of character

Tags:

regex

r

I'm really putting time into learning regex and I'm playing with different toy scenarios. One setup I can't get to work is to grab from the beginning of a string to n occurrence of a character where n > 1.

Here I can grab from the beginning of the string to the first underscore but I can't generalize this to the second or third underscore.

x <- c("a_b_c_d", "1_2_3_4", "<_?_._:")

gsub("_.*$", "", x)

Here's what I'm trying to achieve with regex. (`sub`/`gsub`):

## > sapply(lapply(strsplit(x, "_"), "[", 1:2), paste, collapse="_")
## [1] "a_b" "1_2" "<_?"

#or

## > sapply(lapply(strsplit(x, "_"), "[", 1:3), paste, collapse="_")
## [1] "a_b_c" "1_2_3" "<_?_."

Related post: regex from first character to the end of the string

like image 471
Tyler Rinker Avatar asked Apr 09 '13 18:04

Tyler Rinker


People also ask

What is a character class in regex?

This regex means characters “l”, “m”, “n”, “o”, “p” would match in a string. Subtraction of ranges also works in character classes. This regex means vowels are subtracted from the range “a-z”. Regex patterns discussed so far require that each position in the input string match a specific character class.

What is regex and how does it work?

In general, regex consists of normal characters, character classes, wildcard characters, and quantifiers. We will talk specifically about character classes here. At times there’s a need to match any sequence that contains one or more characters, in any order, that is part of a set of characters.

Can you match non-printable characters in regex?

Notice that you can match also non-printable characters like tabs , new-lines , carriage returns . We are learning how to construct a regex but forgetting a fundamental concept: flags. A regex usually comes within this form / abc /, where the search pattern is delimited by two slash characters /.

How do you extract everything after the letter a in regex?

However, this expression will include the letter “a” in the match. To extract everything after the letter a, we need to introduce a capture group using parentheses: The contents of the parentheses is now capture group 1, and can thus be extracted from the regex return array.


1 Answers

Here's a start. To make this safe for general use, you'll need it to properly escape regular expressions' special characters:

x <- c("a_b_c_d", "1_2_3_4", "<_?_._:", "", "abcd", "____abcd")

matchToNth <- function(char, n) {
    others <- paste0("[^", char, "]*") ## matches "[^_]*" if char is "_"
    mainPat <- paste0(c(rep(c(others, char), n-1), others), collapse="")
    paste0("(^", mainPat, ")", "(.*$)")
}

gsub(matchToNth("_", 2), "\\1", x)
# [1] "a_b"  "1_2"  "<_?"  ""     "abcd" "_" 

gsub(matchToNth("_", 3), "\\1", x)
# [1] "a_b_c" "1_2_3" "<_?_." ""      "abcd"  "__"   
like image 199
Josh O'Brien Avatar answered Nov 13 '22 13:11

Josh O'Brien