I need a regular expression that returns a specific letter and the following (one or two) digits until the next letter. For example, I would like to extract how many carbons (C) are in a formula using regular expressions in R
strings <- c("C16H4ClNO2", "CH8O", "F2Ni")
I need an expression that returns the number of C which can be one or 2 digits and that does not return the number after chlorine (Cl).
substr(strings,regexpr("C[0-9]+",strings) + 1, regexpr("[ABDEFGHIJKLMNOPQRSTUVWXYZ]+",strings) -1)
[1] "16" "C" ""
but the answer I want to be returned is
"16","1","0"
Moreover, I would like the regular expression to automatically locate the next letter and stop before it, instead of having a final position which I specify as a letter not being a C.
makeup
in the CHNOSZ package will parse a chemical formula. Here are some alternatives that use it:
1) Create a list L
of such fully parsed formulas and then for each one check if it has a "C"
component and return its value or 0 if none:
library(CHNOSZ)
L <- Map(makeup, strings)
sapply(L, function(x) if ("C" %in% names(x)) x[["C"]] else 0)
## C16H4ClNO2 CH8O F2Ni
## 16 1 0
Note that L
is a list of the fully parsed formulas in case you have other requirements:
> L
$C16H4ClNO2
C H Cl N O
16 4 1 1 2
$CH8O
C H O
1 8 1
$F2Ni
F Ni
2 1
1a) By adding c(C = 0)
to each list component we can avoid having to test for the existence of carbon yielding the following shorter version of the sapply
line in (1):
sapply(lapply(L, c, c(C = 0)), "[[", "C")
2) This one-line variation of (1) gives the same answer as in (1) except for names. It appends "C0"
to each formula to avoid having to test for the existence of carbon:
sapply(lapply(paste0(strings, "C0"), makeup), "[[", "C")
## [1] 16 1 0
2a) Here is a variation of (2) that eliminates the lapply
by using the fact that makeup
will accept a matrix:
sapply(makeup(as.matrix(paste0(strings, "C0"))), "[[", "C")
## [1] 16 1 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With