Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a chemical formula

Tags:

I'm trying to write a method for an app that takes a chemical formula like "CH3COOH" and returns some sort of collection full of their symbols.

CH3COOH would return [C,H,H,H,C,O,O,H]

I already have something that is kinda working, but it's very complicated and uses a lot of code with a lot of nested if-else structures and loops.

Is there a way I can do this by using some kind of regular expression with String.split or maybe in some other brilliant simple code?

like image 738
Christian Kjær Avatar asked Jun 04 '10 13:06

Christian Kjær


People also ask

How do you decode a chemical formula?

Each element is represented by its atomic symbol in the Periodic Table – e.g. H for hydrogen, Ca for calcium. If more than one atom of a particular element is present, then it's indicated by a number in subscript after the atomic symbol — for example, H2O means there are 2 atoms of hydrogen and one of oxygen.

How do you calculate a chemical formula?

Divide the molar mass of the compound by the empirical formula molar mass. The result should be a whole number or very close to a whole number. Multiply all the subscripts in the empirical formula by the whole number found in step 2. The result is the molecular formula.

What is Cl4 formula?

Trichloro-lambda3-chlorane | Cl4 - PubChem.


1 Answers

I have developed a couple of series of articles on how to parse molecular formulas, including more complex formulas like C6H2(NO2)3CH3 .

The most recent is my presentation "PLY and PyParsing" at PyCon2010 where I compare those two Python parsing systems using a molecular formula evaluator as my sample problem. There's even a video of my presentation.

The presentation was based on a three-part series of articles I did developing a molecular formula parser using ANTLR. In part 3 I compare the ANTLR solution to a hand-written regular expression parser and solutions in PLY and PyParsing.

The regexp and PLY solutions were first developed in a two-part series on two ways of writing parsers in Python.

The regexp solution and base ANTLR/PLY/PyParsing solutions, use a regular expression like [A-Z][a-z]?\d* to match terms in the formula. This is what @David M suggested.

Here is it worked out in Python

import re  # element_name is: capital letter followed by optional lower-case # count is: empty string (so the count is 1), or a set of digits element_pat = re.compile("([A-Z][a-z]?)(\d*)")  all_elements = [] for (element_name, count) in element_pat.findall("CH3COOH"):     if count == "":         count = 1     else:         count = int(count)     all_elements.extend([element_name] * count)  print all_elements 

When I run this (it's hard-coded to use acetic acid, CH3COOH) I get

['C', 'H', 'H', 'H', 'C', 'O', 'O', 'H'] 

Do note that this short bit of code assumes the molecular formula is correct. If you give it something like "##$%^O2#$$#" then it will ignore the fields it doesn't know about and give ['O', 'O']. If you don't want that then you'll have to make it a bit more robust.

If you want to support more complicated formulas, like C6H2(NO2)3CH3, then you'll need to know a bit about tree data structures, specifically (as @Roman points out), abstract syntax trees (most often called ASTs). That's too complicated to get into here, so see my talk and essays for more details.

like image 102
Andrew Dalke Avatar answered Sep 28 '22 05:09

Andrew Dalke