Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx for parsing chemical formulas

I need a way to separate a chemical formula into its components. The result should look like this:

   Ag3PO4 -> [Ag3, P, O4]
      H2O -> [H2, O]
   CH3OOH -> [C, H3, O, O, H]
Ca3(PO4)2 -> [Ca3, (PO4)2]

I don't know regex syntax, but I know I need something like this

[An optional parenthesis][A capital letter][0 or more lowercase letters][0 or more numbers][An optional parenthesis][0 or more numbers]

This worked

NSRegularExpression *regex = [NSRegularExpression
                              regularExpressionWithPattern:@"[A-Z][a-z]*\\d*|\\([^)]+\\)\\d*"
                              options:0
                              error:nil];
NSArray *tests = [[NSArray alloc ] initWithObjects:@"Ca3(PO4)2", @"HCl", @"CaCO3", @"ZnCl2", @"C7H6O2", @"BaSO4", nil];
for (NSString *testString in tests)
{
    NSLog(@"Testing: %@", testString);
    NSArray *myArray = [regex matchesInString:testString options:0 range:NSMakeRange(0, [testString length])] ;
    NSMutableArray *matches = [NSMutableArray arrayWithCapacity:[myArray count]];

    for (NSTextCheckingResult *match in myArray) {
        NSRange matchRange = [match rangeAtIndex:0];
        [matches addObject:[testString substringWithRange:matchRange]];
        NSLog(@"%@", [matches lastObject]);
    }
}
like image 343
michaelsnowden Avatar asked May 12 '14 06:05

michaelsnowden


4 Answers

(PO4)2 really sits aside from all.

Let's start from simple, match items without parenthesis:

[A-Z][a-z]?\d*

Using regex above we can successfully parse Ag3PO4, H2O, CH3OOH.

Then we need to somehow add expression for group. Group by itself can be matched using:

\(.*?\)\d+

So we add or condition:

[A-Z][a-z]?\d*|\(.*?\)\d+

Regular expression visualization

Demo

Which works for given cases. But may be you have some more samples.

Note: It will have problems with nested parenthesis. Ex. Co3(Fe(CN)6)2

If you want to handle that case, you can use the following regex:

[A-Z][a-z]?\d*|(?<!\([^)]*)\(.*\)\d+(?![^(]*\))

Regular expression visualization

For Objective-C you can use the expression without lookarounds:

[A-Z][a-z]?\d*|\([^()]*(?:\(.*\))?[^()]*\)\d+

Regular expression visualization

Demo

Or regex with repetitions (I don't know such formulas, but in case if there is anything like A(B(CD)3E(FG)4)5 - multiple parenthesis blocks inside one.

[A-Z][a-z]?\d*|\((?:[^()]*(?:\(.*\))?[^()]*)+\)\d+

Regular expression visualization

Demo

like image 140
Ulugbek Umirov Avatar answered Oct 16 '22 12:10

Ulugbek Umirov


When you encounter a parenthesis group, you don't want to parse what's inside, right?

If there are no nested parenthesis groups you can simply use

[A-Z][a-z]*\d*|\([^)]+\)\d*

\d is a shorcut for [0-9], [^)] means anything but a parenthesis.

See demo here.

like image 20
Robin Avatar answered Oct 16 '22 14:10

Robin


This should just about work:

/(\(?)([A-Z])([a-z]*)([0-9]*)(\))?([0-9]*)/g

Play around with it here: http://refiddle.com/

like image 3
Christof Avatar answered Oct 16 '22 14:10

Christof


this pattern should work depending on you RegEx engine
([A-Z][a-z]*\d*)|(\((?:[^()]+|(?R))*\)\d*) with gm option
Demo

like image 2
alpha bravo Avatar answered Oct 16 '22 13:10

alpha bravo