Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: regex first occurence based on condition

Tags:

regex

r

I am trying to split strings by using the first white space coming after 3 characters. Here is my code:

string <- c("Le jour la nuit", "Les jours les nuits")
part1 <- sub("(\\S{3,})\\s?(.*)", "\\1", string)
part2 <- sub("(\\S{3,})\\s?(.*)", "\\2", string)

# output
> part1
[1] "Le jour" "Les"    
> part2
[1] "Le la nuit"      "jours les nuits"

For the first part, it works exactly as desired. However, it is not the case for the second part: part2[1] should be la nuit instead of Le la nuit.

I am not sure how achieve this and would be thankful for some guidance.

like image 514
niko Avatar asked Nov 21 '25 04:11

niko


1 Answers

Not sure what you really want but per your requirements, you could use

^(.{3,}?)(?:(?<!,)\\s)+(.*)

This says:

^              # start of the string
(.{3,}?)       # capture 3+ characters lazily, up to...
(?:(?<!,)\\s)+ # 1+ whitespaces that must not be preceeded by a comma
(.*)           # capture the rest of the string

In R:

string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits")
(part1 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\1", string, perl = T))
(part2 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\2", string, perl = T))

Yielding

[1] "Le jour"    "Les"        "les, jours"

and

[1] "la nuit"         "jours les nuits" "les nuits"      


Maybe you need a dataframe as a result, if so, you could define yourself a little function (using sapply and some logic):
make_df <- function(text) {
  parts <- sapply(text, function(x) {
    m <- regexec("^(.{3,}?)(?:(?<!,)\\s)+(.*)", x, perl = T)
    groups <- regmatches(x, m)
    c(groups[[1]][2], groups[[1]][3])
  }, USE.NAMES = F)
  (setNames(as.data.frame(t(parts), stringsAsFactors = F), c("part1", "part2")))
}

(df <- make_df(string))

This would yield for string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits", "somejunk"):

       part1           part2
1    Le jour         la nuit
2        Les jours les nuits
3 les, jours       les nuits
4       <NA>            <NA>
like image 152
Jan Avatar answered Nov 22 '25 18:11

Jan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!