Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this regex space in the last match?

Tags:

regex

I have the following text:

2 HCl + 12 Na + 3 (Na₃Cl₂)₂₄ → 2 NaCl + H₂

I would like to match each molecule, including its coefficient. The regex below is almost working, but the space character, right before the last match, is getting matched, which it shouldn't. Here's the regex I'm using:

(([0-9]* ??\(*([a-z]+[₀-₉]*)+\)*[₀-₉]*))

If you look at this regex101 link, it might be easier to see what my problem is: https://regex101.com/r/hK7jY6/1

like image 946
tobloef Avatar asked Feb 06 '16 15:02

tobloef


2 Answers

Update

If your strings are just valid chemical formulae, why bother using subscript/digits/letters? There are non-whitespace symbols. Since there must be an obligatory letter or a (, use them in the character class [a-z(], and then append \S* (zero or more non-whitespaces):

/(?:\d+ )?[a-z(]\S*/gi

See the regex demo. The (?:...)? construct is an optional non-capturing group (i.e. a group that is only used to group, but not capture (=store the submatch inside a memory buffer).

Original answer with explanation of the root cause

You have the digits and space pattern at the beginning as optional subpatterns, instead, you need to match them obligatorily, but place into an optional group:

(?:[0-9]+ )?\(*([a-z]+[₀-₉]*)+\)*[₀-₉]*

See regex demo

Your [0-9]* ?? is turned into a (?:[0-9]+ )?. Note that here you do not have to use a lazy version of ? quantifier, it will work the same way as a greedy one. I also removed 2 unnecessary outer grouping (...).

Since the (?:[0-9]+ )? group is optional, the space will be matched only if there is a digit in front of it. If there is no digit, the next character that can be matched is zero or more (. Then, a [a-z] letter should be present (if there is no (, the letter will be the first character in the match).

Let me break it down:

  • (?:[0-9]+ )? - optional one or more digits followed by a space
  • \(* - zero or more ( (maybe you meant ?)
  • ([a-z]+[₀-₉]*)+ - zero or more sequences of one or more letters followed by zero or more sbscript digits
  • \)* - zero or more ) (maybe you meant ?)
  • [₀-₉]* - zero or more subscript digits

If you want to also make sure you do not match (Ca or H), you should also split the \(*...\)* like this:

(?:[0-9]+ )?(?:(?:[a-z]+[₀-₉]*)+|\((?:[a-z]+[₀-₉]*)+\))[₀-₉]*

See another demo

like image 92
Wiktor Stribiżew Avatar answered Oct 11 '22 02:10

Wiktor Stribiżew


While Wiktor's answer is very informative, I think I might have found an easier way of doing this.

([0-9]+ )*[a-z\(₀-₉\)]+

This will match all the parts of the equation as far as I can tell.

Demo

Update

Please see Wiktors updates answer, it's better than this.

like image 2
tobloef Avatar answered Oct 11 '22 02:10

tobloef