I would like to extract the character preceding the first dot in a column of strings. I can do so with the code below. Although, the code seems overly complex and I had to resort to a for-loop
. Is there an easier way? I particularly am interested in a regex
solution.
Note that finding the last number in each string will not work with my real data, although that approach would work with this example.
Thank you for any advice.
my.data <- read.table(text = '
my.string state
......... A
1........ B
112...... C
11111.... D
1111113.. E
111111111 F
111111111 G
', header = TRUE, stringsAsFactors = FALSE)
desired.result <- c(NA,1,2,1,3,NA,NA)
Identify the position of the first dot:
my.data$first.dot <- apply(my.data, 1, function(x) {
as.numeric(gregexpr("\\.", x['my.string'])[[1]])[1]
})
Split strings:
split.strings <- t(apply(my.data, 1, function(x) { (strsplit(x['my.string'], '')[[1]]) } ))
my.data$revised.first.dot <- ifelse(my.data$first.dot < 2, NA, my.data$first.dot-1)
Extract the character preceding the first dot:
for(i in 1:nrow(my.data)) {
my.data$character.before.dot[i] <- split.strings[i,my.data$revised.first.dot[i]]
}
my.data
# my.string state first.dot revised.first.dot character.before.dot
# 1 ......... A 1 NA <NA>
# 2 1........ B 2 1 1
# 3 112...... C 4 3 2
# 4 11111.... D 6 5 1
# 5 1111113.. E 8 7 3
# 6 111111111 F -1 NA <NA>
# 7 111111111 G -1 NA <NA>
Here is a related post:
find location of character in string
Use the below regex and don't forget to enable perl=TRUE
parameter.
^[^.]*?\K[^.](?=\.)
In R, the regex would be like,
^[^.]*?\\K[^.](?=\\.)
DEMO
> library(stringr)
> as.numeric(str_extract(my.data$my.string, perl("^[^.]*?\\K[^.](?=\\.)")))
[1] NA 1 2 1 3 NA NA
Pattern Explanation:
^
Asserts that we are at the start.[^.]*?
Non-greedy match of any character upto the first dot.\K
Discards previously matched characters.[^.]
Character we are going to match must not be a dot.(?=\.)
And this character must be followed by a dot. So it matches the character which exists just before to the first dot.The simplest regex would be ^([^.])+(?=\.)
:
^ # Start of string
( # Start of group 1
[^.] # Match any character except .
)+ # Repeat as many times as needed, overwriting the previous match
(?=\.) # Assert the next character is a .
Test it live on regex101.com.
The contents of group 1 will be your desired character. I'm not much of an R guy, but according to RegexBuddy, the following should work:
matches <- regexpr("^([^.])+(?=\\.)", my.data, perl=TRUE);
result <- attr(matches, "capture.start")[,1]
attr(result, "match.length") <- attr(matches, "capture.length")[,1]
regmatches(my.data, result)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With