Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split on first/nth occurrence of delimiter

Tags:

regex

r

I am trying something I thought would be easy. I'm looking for a single regex solution (though others are welcomed for completeness). I want to split on n occurrences of a delimiter.

Here is some data:

x <- "I like_to see_how_too"
pat <- "_"

Desired outcome

Say I want to split on first occurrence of _:

[1] "I like"  "to see_how_too"

Say I want to split on second occurrence of _:

[1] "I like_to see"   "how_too"

Ideally, if the solution is a regex one liner generalizable to nth occurrence; the solution will use strsplit with a single regex.

Here's a solution that doesn't fit my parameters of single regex that works with strsplit

x <- "I like_to see_how_too"
y <- "_"
n <- 1
loc <- gregexpr("_", x)[[1]][n]

c(substr(x, 1, loc-1), substr(x, loc + 1, nchar(x)))
like image 582
Tyler Rinker Avatar asked Oct 10 '14 14:10

Tyler Rinker


4 Answers

Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.

library(gsubfn)
x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like"  "to see_how_too"

If you would like the nth occurrence to be user defined, you could use the following:

n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too" 
like image 173
hwnd Avatar answered Oct 25 '22 07:10

hwnd


Non-Solution

Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.

Below is the regex to split the string at the 3rd _

^[^_]*(?:_[^_]*){2}\K_

If you want to split at the nth occurrence of _, just change 2 to (n - 1).

Demo on regex101

That was the plan. However, strsplit seems to think differently.

Actual execution

Demo on ideone.com

x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x,  "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x,  "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)

# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but"   "it_seems to"   "be_impossible"

# strsplit(x,  "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but"   "it_seems to"   "be_impossible"

# strsplit(x,  "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like"     "to see"     "how"        "too but"    "it"        
# [6] "seems to"   "be"         "impossible" 

It still fails to work on a stronger assertion \A

strsplit(x,  "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like"     "to see"     "how"        "too but"    "it"        
# [6] "seems to"   "be"         "impossible"

Explanation?

This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.

This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.

like image 38
nhahtdh Avatar answered Oct 25 '22 08:10

nhahtdh


Rather than split you do match to get your split strings.

Try this regex:

^((?:[^_]*_){1}[^_]*)_(.*)$

Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.

RegEx Demo

Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:

^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_

Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.

  • (*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
  • (*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
  • (*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.

RegEx Demo2

x <- "I like_to see_how_too"

strsplit(x,  "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x,  "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)

## > strsplit(x,  "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how"    "too"   

## > strsplit(x,  "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too" 
like image 22
anubhava Avatar answered Oct 25 '22 08:10

anubhava


This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.

It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:

library(gsubfn)

k <- c(2, 4) # split at 2nd and 4th _

p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")

giving:

[[1]]
[1] "aa_bb" "cc_dd" "ee_ff"

If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.

See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.

like image 23
G. Grothendieck Avatar answered Oct 25 '22 07:10

G. Grothendieck