Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A strict regular expression for matching chemical formulae

Tags:

regex

perl

pcre

In the course of processing a large textual chemical database with Perl, I had been faced with the problem of using a regex to match chemical formulae. I have seen these two previous topics, but the suggested answers there are too loose for my requirements.

Specifically, my (admittedly limited) research has led me to this posting that gives a regex for the currently accepted chemical symbols, which I'll copy here for reference

[BCFHIKNOPSUVWY]|[ISZ][nr]|[ACELP][ru]|A[cglmst]|B[aehikr]|C[adeflos]|D[bsy]|Es|F[elmr]|G[ade]|H[efgos]|Kr|L[aiv]|M[cdgnot]|N[abdehiop]|O[gs]|P[abdmot]|R[abe-hnu]|S[bcegim]|T[abcehilms]|Xe|Yb

(Thus e.g. C, Cm, and Cn will pass, but not Cg or Cx.)

As with the previous questions, I also need to match numbers, complete sets of parentheses and complete sets of square brackets, so that both e.g. C2H6O and (CH3)2CFCOO(CH2)2Si(CH3)2Cl are matched.

So how do I combine the previous solutions with the grand regex for matching valid chemical elements to strictly match a chemical formula?

(If it's not too much trouble to add, a blow-by-blow account of how to humanly parse the regex would be appreciated greatly, though not strictly necessary.)

like image 217
Krissy Budd Avatar asked Sep 13 '17 14:09

Krissy Budd


People also ask

What is regular expression matching?

A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.

What can be matched using (*) in a regular expression?

A regular expression followed by an asterisk ( * ) matches zero or more occurrences of the regular expression. If there is any choice, the first matching string in a line is used.

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .


1 Answers

Brief

I decided why not create a massive regex to do what you want (but still maintain a clean regex). This regex would be used in conjunction with a loop to go over matches for bracket or parentheses groups.


Assumptions

I am assuming the following since the OP has not given a full list of positive and negative matches:

  • Nested parentheses aren't possible
  • Nested square brackets aren't possible
  • Square bracket groups that surround a single parentheses group are redundant and therefore incorrect
  • Square bracket groups must contain at least 2 groups, of which 1 such group must be a parentheses group

If any of these assumptions are incorrect, please let me know so that I may fix the regex accordingly


Answer

View this regex in use here

Code

(?(DEFINE)
  (?# Periodic elements )
  (?<Hydrogen>H)
  (?<Helium>He)
  (?<Lithium>Li)
  (?<Beryllium>Be)
  (?<Boron>B)
  (?<Carbon>C)
  (?<Nitrogen>N)
  (?<Oxygen>O)
  (?<Fluorine>F)
  (?<Neon>Ne)
  (?<Sodium>Na)
  (?<Magnesium>Mg)
  (?<Aluminum>Al)
  (?<Silicon>Si)
  (?<Phosphorus>P)
  (?<Sulfur>S)
  (?<Chlorine>Cl)
  (?<Argon>Ar)
  (?<Potassium>K)
  (?<Calcium>Ca)
  (?<Scandium>Sc)
  (?<Titanium>Ti)
  (?<Vanadium>V)
  (?<Chromium>Cr)
  (?<Manganese>Mn)
  (?<Iron>Fe)
  (?<Cobalt>Co)
  (?<Nickel>Ni)
  (?<Copper>Cu)
  (?<Zinc>Zn)
  (?<Gallium>Ga)
  (?<Germanium>Ge)
  (?<Arsenic>As)
  (?<Selenium>Se)
  (?<Bromine>Br)
  (?<Krypton>Kr)
  (?<Rubidium>Rb)
  (?<Strontium>Sr)
  (?<Yttrium>Y)
  (?<Zirconium>Zr)
  (?<Niobium>Nb)
  (?<Molybdenum>Mo)
  (?<Technetium>Tc)
  (?<Ruthenium>Ru)
  (?<Rhodium>Rh)
  (?<Palladium>Pd)
  (?<Silver>Ag)
  (?<Cadmium>Cd)
  (?<Indium>In)
  (?<Tin>Sn)
  (?<Antimony>Sb)
  (?<Tellurium>Te)
  (?<Iodine>I)
  (?<Xenon>Xe)
  (?<Cesium>Cs)
  (?<Barium>Ba)
  (?<Lanthanum>La)
  (?<Cerium>Ce)
  (?<Praseodymium>Pr)
  (?<Neodymium>Nd)
  (?<Promethium>Pm)
  (?<Samarium>Sm)
  (?<Europium>Eu)
  (?<Gadolinium>Gd)
  (?<Terbium>Tb)
  (?<Dysprosium>Dy)
  (?<Holmium>Ho)
  (?<Erbium>Er)
  (?<Thulium>Tm)
  (?<Ytterbium>Yb)
  (?<Lutetium>Lu)
  (?<Hafnium>Hf)
  (?<Tantalum>Ta)
  (?<Tungsten>W)
  (?<Rhenium>Re)
  (?<Osmium>Os)
  (?<Iridium>Ir)
  (?<Platinum>Pt)
  (?<Gold>Au)
  (?<Mercury>Hg)
  (?<Thallium>Tl)
  (?<Lead>Pb)
  (?<Bismuth>Bi)
  (?<Polonium>Po)
  (?<Astatine>At)
  (?<Radon>Rn)
  (?<Francium>Fr)
  (?<Radium>Ra)
  (?<Actinium>Ac)
  (?<Thorium>Th)
  (?<Protactinium>Pa)
  (?<Uranium>U)
  (?<Neptunium>Np)
  (?<Plutonium>Pu)
  (?<Americium>Am)
  (?<Curium>Cm)
  (?<Berkelium>Bk)
  (?<Californium>Cf)
  (?<Einsteinium>Es)
  (?<Fermium>Fm)
  (?<Mendelevium>Md)
  (?<Nobelium>No)
  (?<Lawrencium>Lr)
  (?<Rutherfordium>Rf)
  (?<Dubnium>Db)
  (?<Seaborgium>Sg)
  (?<Bohrium>Bh)
  (?<Hassium>Hs)
  (?<Meitnerium>Mt)
  (?<Darmstadtium>Ds)
  (?<Roentgenium>Rg)
  (?<Copernicium>Cn)
  (?<Nihonium>Nh)
  (?<Flerovium>Fl)
  (?<Moscovium>Mc)
  (?<Livermorium>Lv)
  (?<Tennessine>Ts)
  (?<Oganesson>Og)
  (?# Regex )
  (?<Element>(?&Actinium)|(?&Silver)|(?&Aluminum)|(?&Americium)|(?&Argon)|(?&Arsenic)|(?&Astatine)|(?&Gold)|(?&Barium)|(?&Beryllium)|(?&Bohrium)|(?&Bismuth)|(?&Berkelium)|(?&Bromine)|(?&Boron)|(?&Calcium)|(?&Cadmium)|(?&Cerium)|(?&Californium)|(?&Chlorine)|(?&Curium)|(?&Copernicium)|(?&Cobalt)|(?&Chromium)|(?&Cesium)|(?&Copper)|(?&Carbon)|(?&Dubnium)|(?&Darmstadtium)|(?&Dysprosium)|(?&Erbium)|(?&Einsteinium)|(?&Europium)|(?&Iron)|(?&Flerovium)|(?&Fermium)|(?&Francium)|(?&Fluorine)|(?&Gallium)|(?&Gadolinium)|(?&Germanium)|(?&Helium)|(?&Hafnium)|(?&Mercury)|(?&Holmium)|(?&Hassium)|(?&Hydrogen)|(?&Indium)|(?&Iridium)|(?&Iodine)|(?&Krypton)|(?&Potassium)|(?&Lanthanum)|(?&Lithium)|(?&Lawrencium)|(?&Lutetium)|(?&Livermorium)|(?&Moscovium)|(?&Mendelevium)|(?&Magnesium)|(?&Manganese)|(?&Molybdenum)|(?&Meitnerium)|(?&Sodium)|(?&Niobium)|(?&Neodymium)|(?&Neon)|(?&Nihonium)|(?&Nickel)|(?&Nobelium)|(?&Neptunium)|(?&Nitrogen)|(?&Oganesson)|(?&Osmium)|(?&Oxygen)|(?&Protactinium)|(?&Lead)|(?&Palladium)|(?&Promethium)|(?&Polonium)|(?&Praseodymium)|(?&Platinum)|(?&Plutonium)|(?&Phosphorus)|(?&Radium)|(?&Rubidium)|(?&Rhenium)|(?&Rutherfordium)|(?&Roentgenium)|(?&Rhodium)|(?&Radon)|(?&Ruthenium)|(?&Antimony)|(?&Scandium)|(?&Selenium)|(?&Seaborgium)|(?&Silicon)|(?&Samarium)|(?&Tin)|(?&Strontium)|(?&Sulfur)|(?&Tantalum)|(?&Terbium)|(?&Technetium)|(?&Tellurium)|(?&Thorium)|(?&Titanium)|(?&Thallium)|(?&Thulium)|(?&Tennessine)|(?&Uranium)|(?&Vanadium)|(?&Tungsten)|(?&Xenon)|(?&Ytterbium)|(?&Yttrium)|(?&Zirconium)|(?&Zinc))
  (?<Num>(?:[1-9]\d*)?)
  (?<ElementGroup>(?:(?&Element)(?&Num))+)
  (?<ElementParenthesesGroup>\((?&ElementGroup)+\)(?&Num))
  (?<ElementSquareBracketGroup>\[(?:(?:(?&ElementParenthesesGroup)(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+)|(?:(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+(?&ElementParenthesesGroup)))\](?&Num))
)
^((?<Brackets>(?&ElementSquareBracketGroup))|(?<Parentheses>(?&ElementParenthesesGroup))|(?<Group>(?&ElementGroup)))+$

Explanation

  1. The first part of the (?(DEFINE)) section lists each periodic element (ordered by atomic number for easy lookup).
  2. The Element group acts as a simple or | between each of the elements listed in 1. ensuring that each element's symbol is ordered alphabetically by the first character, and then by symbol character length (so as not to catch, for example, Carbon C instead of Calcium Ca)
  3. ElementGroup specifies a group of chemicals in the format: one or more Element followed by zero or more digits, excluding zero (specified by the group Num)
    • Valid Examples
      • C - Element
      • CH - Element followed by another Element
      • CH3 -Element followed by another Element and a Num
      • O2 - Element followed by a Num
    • Invalid Examples
      • N0 - 0 cannot be used explicitly
      • N01 - Num group specifies the number must begin with 1-9 or not have a number
      • A - Element does not exist
      • c - Element does not exist - case sensitive regex
  4. ElementParenthesesGroup specifies one or more groupings of ElementGroup between parentheses ( ) but containing at least one ElementGroup
    • Valid Examples
      • (CH) - ElementGroup surrounded by parentheses
      • (CH3) - ElementGroup surrounded by parentheses
      • (CH3NO4) - multiple ElementGroup surrounded by parentheses
      • (CH3N04)2 - multiple ElementGroup surrounded by parentheses followed by a Num
    • Invalid Examples
      • (CH[NO4]) - Only ElementGroup is valid inside ElementParenthesesGroup
  5. ElementSquareBracketGroup specifies a grouping of ElementParenthesesGroup or ElementGroup between square brackets [ ] but containing at least one ElementParenthesesGroup and one other group (ElementParenthesesGroup or ElementGroup)
    • Valid Examples
      • [CH3(NO4)] - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup
      • [(NO4)CH]2 - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup followed by Num
      • [(NO4)(CH3)] - Contains at least one ElementParenthesesGroup and one other ElementParenthesesGroup or ElementGroup
    • Invalid Examples
      • [(NO4)] - Does not contain second group, brackets [ ] are redundant
      • [NO4] - Does not contain ElementParenthesesGroup

Additional Information

I realize this is a very long answer, but the OP is asking a very specific question and wants to ensure specific criteria are met.

Ensure the following flags are set:

  • g - ensures global matches
  • x - ensures whitespace is ignored
  • if the data is across multiple lines (separated by a newline character) use m for multi line

Note: Regex will only capture the last group of type X that it finds (and overwrite the previously captured group of said type X. This is the default behaviour of regex and there is no way to currently override this behaviour. This may give you undesirable results. You can see this with the last example in the linked regex as well as with your example of (CH3)2CFCOO(CH2)2Si(CH3)2Cl since there are multiple of each group type.

like image 186
ctwheels Avatar answered Sep 22 '22 23:09

ctwheels