Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract character preceding first dot in a string

Tags:

string

regex

r

I would like to extract the character preceding the first dot in a column of strings. I can do so with the code below. Although, the code seems overly complex and I had to resort to a for-loop. Is there an easier way? I particularly am interested in a regex solution.

Note that finding the last number in each string will not work with my real data, although that approach would work with this example.

Thank you for any advice.

my.data <- read.table(text = '
     my.string  state
     .........    A
     1........    B
     112......    C
     11111....    D
     1111113..    E
     111111111    F
     111111111    G
', header = TRUE, stringsAsFactors = FALSE)

desired.result <- c(NA,1,2,1,3,NA,NA)

Identify the position of the first dot:

my.data$first.dot <- apply(my.data, 1, function(x) {     
                                as.numeric(gregexpr("\\.", x['my.string'])[[1]])[1]
                          })

Split strings:

split.strings <- t(apply(my.data, 1, function(x) { (strsplit(x['my.string'], '')[[1]]) } ))

my.data$revised.first.dot <- ifelse(my.data$first.dot < 2, NA, my.data$first.dot-1)

Extract the character preceding the first dot:

for(i in 1:nrow(my.data)) {
     my.data$character.before.dot[i] <- split.strings[i,my.data$revised.first.dot[i]]
}

my.data

#   my.string state first.dot revised.first.dot character.before.dot
# 1 .........     A         1                NA                 <NA>
# 2 1........     B         2                 1                    1
# 3 112......     C         4                 3                    2
# 4 11111....     D         6                 5                    1
# 5 1111113..     E         8                 7                    3
# 6 111111111     F        -1                NA                 <NA>
# 7 111111111     G        -1                NA                 <NA>

Here is a related post:

find location of character in string

like image 894
Mark Miller Avatar asked Dec 06 '22 23:12

Mark Miller


2 Answers

Use the below regex and don't forget to enable perl=TRUE parameter.

^[^.]*?\K[^.](?=\.)

In R, the regex would be like,

^[^.]*?\\K[^.](?=\\.)

DEMO

> library(stringr)
> as.numeric(str_extract(my.data$my.string, perl("^[^.]*?\\K[^.](?=\\.)")))
[1] NA  1  2  1  3 NA NA

Pattern Explanation:

  • ^ Asserts that we are at the start.
  • [^.]*? Non-greedy match of any character upto the first dot.
  • \K Discards previously matched characters.
  • [^.] Character we are going to match must not be a dot.
  • (?=\.) And this character must be followed by a dot. So it matches the character which exists just before to the first dot.
like image 87
Avinash Raj Avatar answered Dec 19 '22 11:12

Avinash Raj


The simplest regex would be ^([^.])+(?=\.):

^      # Start of string
(      # Start of group 1
 [^.]  # Match any character except .
)+     # Repeat as many times as needed, overwriting the previous match
(?=\.) # Assert the next character is a .

Test it live on regex101.com.

The contents of group 1 will be your desired character. I'm not much of an R guy, but according to RegexBuddy, the following should work:

matches <- regexpr("^([^.])+(?=\\.)", my.data, perl=TRUE);
result <- attr(matches, "capture.start")[,1]
attr(result, "match.length") <- attr(matches, "capture.length")[,1]
regmatches(my.data, result)
like image 35
Tim Pietzcker Avatar answered Dec 19 '22 10:12

Tim Pietzcker