Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I parse a chemical formula using a regular expression?

I have a list patterns:

patterns=['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al',
       'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn',
       'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb',
       'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In',
       'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm',
       'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta',
       'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At',
       'Rn']

and I have big dataframe with strings, for example:

str0='Mg0.97Fe0.03B2'
str1='Tl0.5Hg0.5Ba2Ca2Cu3O8'

I am trying this:

keyss=list(filter(None,regex.split("[^a-zA-Z]*",somestring)))
values=list(filter(None,regex.split("[^0-9.0-9]*",somestring)))

Sometimes, this works:

str3='Hg0.75SrBa2Ca2Cu3O8'
keyss=list(filter(None,regex.split("[^a-zA-Z]*",str3)))
values=list(filter(None,regex.split("[^0-9.0-9]*",str3))
['Ba', 'Fe', 'Co', 'Mn', 'As']
['1', '1.832', '0.15', '0.018', '2']

However, if I have a string like this:

str3='Hg0.75SrBa2Ca2Cu3O8'
keyss=list(filter(None,regex.split("[^a-zA-Z]*",str3)))
values=list(filter(None,regex.split("[^0-9.0-9]*",str3)))
['Hg', 'SrBa', 'Ca', 'Cu', 'O']!=['Hg', 'Sr','Ba', 'Ca', 'Cu', 'O']
['0.75', '2', '2', '3', '8']!=['0.75', '1','2', '2', '3', '8']

or this

str4='NbSn3'
keyss=list(filter(None,regex.split("[^a-zA-Z]*",str4)))
values=list(filter(None,regex.split("[^0-9.0-9]*",str4)))
['NbSn']!=['Nb','Sn']
['3']!=['1','3']
str4='Pb1.4Sr4Y1.2Ca0.8Cu4.6O'
...

My code is not working correctly. How I can fix it?

like image 237
Oleg Avatar asked Dec 07 '20 10:12

Oleg


People also ask

How do you decode a chemical formula?

Each element is represented by its atomic symbol in the Periodic Table – e.g. H for hydrogen, Ca for calcium. If more than one atom of a particular element is present, then it's indicated by a number in subscript after the atomic symbol — for example, H2O means there are 2 atoms of hydrogen and one of oxygen.

What is parsing in regex?

The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.


Video Answer


1 Answers

I guess you started good with patterns and then dropped the idea which is probably not helpful (you could use it in pyparsing grammar) but there is indeed a simpler approach that follows your latter idea.

I suggest you do something like this:

str3='Hg0.75SrBa2Ca2Cu3O8'
splitted = list(regex.split("([A-Z][a-z]*)",str3))
keyss = list(filter(lambda a: a[0].isupper() if a else False, splitted))
values = list(filter(lambda a: a[0].isdigit() if a else False, splitted))
print(keyss, values)

['Hg', 'Sr', 'Ba', 'Ca', 'Cu', 'O'] ['0.75', '2', '2', '3', '8']

like image 114
sophros Avatar answered Sep 21 '22 14:09

sophros