Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression that returns numbers following a specific letter until the next letter

I need a regular expression that returns a specific letter and the following (one or two) digits until the next letter. For example, I would like to extract how many carbons (C) are in a formula using regular expressions in R

strings <- c("C16H4ClNO2", "CH8O", "F2Ni")

I need an expression that returns the number of C which can be one or 2 digits and that does not return the number after chlorine (Cl).

substr(strings,regexpr("C[0-9]+",strings) + 1, regexpr("[ABDEFGHIJKLMNOPQRSTUVWXYZ]+",strings) -1)
[1] "16" "C"  ""  

but the answer I want to be returned is

"16","1","0"

Moreover, I would like the regular expression to automatically locate the next letter and stop before it, instead of having a final position which I specify as a letter not being a C.

like image 612
andrea mizzi Avatar asked Dec 23 '22 20:12

andrea mizzi


1 Answers

makeup in the CHNOSZ package will parse a chemical formula. Here are some alternatives that use it:

1) Create a list L of such fully parsed formulas and then for each one check if it has a "C" component and return its value or 0 if none:

library(CHNOSZ)

L <- Map(makeup, strings)
sapply(L, function(x) if ("C" %in% names(x)) x[["C"]] else 0)
## C16H4ClNO2       CH8O       F2Ni 
##         16          1          0 

Note that L is a list of the fully parsed formulas in case you have other requirements:

> L
$C16H4ClNO2
 C  H Cl  N  O 
16  4  1  1  2 

$CH8O
C H O 
1 8 1 

$F2Ni
 F Ni 
 2  1 

1a) By adding c(C = 0) to each list component we can avoid having to test for the existence of carbon yielding the following shorter version of the sapply line in (1):

sapply(lapply(L, c, c(C = 0)), "[[", "C")

2) This one-line variation of (1) gives the same answer as in (1) except for names. It appends "C0" to each formula to avoid having to test for the existence of carbon:

sapply(lapply(paste0(strings, "C0"), makeup), "[[", "C")
## [1] 16  1  0

2a) Here is a variation of (2) that eliminates the lapply by using the fact that makeup will accept a matrix:

sapply(makeup(as.matrix(paste0(strings, "C0"))), "[[", "C")
## [1] 16  1  0
like image 110
G. Grothendieck Avatar answered Jan 31 '23 09:01

G. Grothendieck