I have the following text:
2 HCl + 12 Na + 3 (Na₃Cl₂)₂₄ → 2 NaCl + H₂
I would like to match each molecule, including its coefficient. The regex below is almost working, but the space character, right before the last match, is getting matched, which it shouldn't. Here's the regex I'm using:
(([0-9]* ??\(*([a-z]+[₀-₉]*)+\)*[₀-₉]*))
If you look at this regex101 link, it might be easier to see what my problem is: https://regex101.com/r/hK7jY6/1
If your strings are just valid chemical formulae, why bother using subscript/digits/letters? There are non-whitespace symbols. Since there must be an obligatory letter or a (
, use them in the character class [a-z(]
, and then append \S*
(zero or more non-whitespaces):
/(?:\d+ )?[a-z(]\S*/gi
See the regex demo. The (?:...)?
construct is an optional non-capturing group (i.e. a group that is only used to group, but not capture (=store the submatch inside a memory buffer).
You have the digits and space pattern at the beginning as optional subpatterns, instead, you need to match them obligatorily, but place into an optional group:
(?:[0-9]+ )?\(*([a-z]+[₀-₉]*)+\)*[₀-₉]*
See regex demo
Your [0-9]* ??
is turned into a (?:[0-9]+ )?
. Note that here you do not have to use a lazy version of ?
quantifier, it will work the same way as a greedy one. I also removed 2 unnecessary outer grouping (...)
.
Since the (?:[0-9]+ )?
group is optional, the space will be matched only if there is a digit in front of it. If there is no digit, the next character that can be matched is zero or more (
. Then, a [a-z]
letter should be present (if there is no (
, the letter will be the first character in the match).
Let me break it down:
(?:[0-9]+ )?
- optional one or more digits followed by a space\(*
- zero or more (
(maybe you meant ?
)([a-z]+[₀-₉]*)+
- zero or more sequences of one or more letters followed by zero or more sbscript digits\)*
- zero or more )
(maybe you meant ?
)[₀-₉]*
- zero or more subscript digitsIf you want to also make sure you do not match (Ca
or H)
, you should also split the \(*...\)*
like this:
(?:[0-9]+ )?(?:(?:[a-z]+[₀-₉]*)+|\((?:[a-z]+[₀-₉]*)+\))[₀-₉]*
See another demo
While Wiktor's answer is very informative, I think I might have found an easier way of doing this.
([0-9]+ )*[a-z\(₀-₉\)]+
This will match all the parts of the equation as far as I can tell.
Demo
Update
Please see Wiktors updates answer, it's better than this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With