I need a way to separate a chemical formula into its components. The result should look like this:
Ag3PO4 -> [Ag3, P, O4]
H2O -> [H2, O]
CH3OOH -> [C, H3, O, O, H]
Ca3(PO4)2 -> [Ca3, (PO4)2]
I don't know regex syntax, but I know I need something like this
[An optional parenthesis][A capital letter][0 or more lowercase letters][0 or more numbers][An optional parenthesis][0 or more numbers]
This worked
NSRegularExpression *regex = [NSRegularExpression
regularExpressionWithPattern:@"[A-Z][a-z]*\\d*|\\([^)]+\\)\\d*"
options:0
error:nil];
NSArray *tests = [[NSArray alloc ] initWithObjects:@"Ca3(PO4)2", @"HCl", @"CaCO3", @"ZnCl2", @"C7H6O2", @"BaSO4", nil];
for (NSString *testString in tests)
{
NSLog(@"Testing: %@", testString);
NSArray *myArray = [regex matchesInString:testString options:0 range:NSMakeRange(0, [testString length])] ;
NSMutableArray *matches = [NSMutableArray arrayWithCapacity:[myArray count]];
for (NSTextCheckingResult *match in myArray) {
NSRange matchRange = [match rangeAtIndex:0];
[matches addObject:[testString substringWithRange:matchRange]];
NSLog(@"%@", [matches lastObject]);
}
}
(PO4)2
really sits aside from all.
Let's start from simple, match items without parenthesis:
[A-Z][a-z]?\d*
Using regex above we can successfully parse Ag3PO4
, H2O
, CH3OOH
.
Then we need to somehow add expression for group. Group by itself can be matched using:
\(.*?\)\d+
So we add or
condition:
[A-Z][a-z]?\d*|\(.*?\)\d+
Demo
Which works for given cases. But may be you have some more samples.
Note: It will have problems with nested parenthesis. Ex. Co3(Fe(CN)6)2
If you want to handle that case, you can use the following regex:
[A-Z][a-z]?\d*|(?<!\([^)]*)\(.*\)\d+(?![^(]*\))
For Objective-C you can use the expression without lookarounds:
[A-Z][a-z]?\d*|\([^()]*(?:\(.*\))?[^()]*\)\d+
Demo
Or regex with repetitions (I don't know such formulas, but in case if there is anything like A(B(CD)3E(FG)4)5
- multiple parenthesis blocks inside one.
[A-Z][a-z]?\d*|\((?:[^()]*(?:\(.*\))?[^()]*)+\)\d+
Demo
When you encounter a parenthesis group, you don't want to parse what's inside, right?
If there are no nested parenthesis groups you can simply use
[A-Z][a-z]*\d*|\([^)]+\)\d*
\d
is a shorcut for [0-9]
, [^)]
means anything but a parenthesis.
See demo here.
This should just about work:
/(\(?)([A-Z])([a-z]*)([0-9]*)(\))?([0-9]*)/g
Play around with it here: http://refiddle.com/
this pattern should work depending on you RegEx engine([A-Z][a-z]*\d*)|(\((?:[^()]+|(?R))*\)\d*)
with gm
option
Demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With