I am trying to parse a chemical formula (in the format, for example: Al2O3
or O3
or C
or C11H22O12
) in C# from a string. It works fine unless there is only one atom of a particular element (e.g. the oxygen atom in H2O
). How can I fix that problem, and in addition, is there a better way to parse a chemical formula string than I am doing?
ChemicalElement is a class representing a chemical element. It has properties AtomicNumber (int), Name (string), Symbol (string). ChemicalFormulaComponent is a class representing a chemical element and atom count (e.g. part of a formula). It has properties Element (ChemicalElement), AtomCount (int).
The rest should be clear enough to understand (I hope) but please let me know with a comment if I can clarify anything, before you answer.
Here is my current code:
/// <summary>
/// Parses a chemical formula from a string.
/// </summary>
/// <param name="chemicalFormula">The string to parse.</param>
/// <exception cref="FormatException">The chemical formula was in an invalid format.</exception>
public static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
{
Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();
string nameBuffer = string.Empty;
int countBuffer = 0;
for (int i = 0; i < chemicalFormula.Length; i++)
{
char c = chemicalFormula[i];
if (!char.IsLetterOrDigit(c) || !char.IsUpper(chemicalFormula, 0))
{
throw new FormatException("Input string was in an incorrect format.");
}
else if (char.IsUpper(c))
{
// Add the chemical element and its atom count
if (countBuffer > 0)
{
formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));
// Reset
nameBuffer = string.Empty;
countBuffer = 0;
}
nameBuffer += c;
}
else if (char.IsLower(c))
{
nameBuffer += c;
}
else if (char.IsDigit(c))
{
if (countBuffer == 0)
{
countBuffer = c - '0';
}
else
{
countBuffer = (countBuffer * 10) + (c - '0');
}
}
}
return formula;
}
I rewrote your parser using regular expressions. Regular expressions fit the bill perfectly for what you're doing. Hope this helps.
public static void Main(string[] args)
{
var testCases = new List<string>
{
"C11H22O12",
"Al2O3",
"O3",
"C",
"H2O"
};
foreach (string testCase in testCases)
{
Console.WriteLine("Testing {0}", testCase);
var formula = FormulaFromString(testCase);
foreach (var element in formula)
{
Console.WriteLine("{0} : {1}", element.Element, element.Count);
}
Console.WriteLine();
}
/* Produced the following output
Testing C11H22O12
C : 11
H : 22
O : 12
Testing Al2O3
Al : 2
O : 3
Testing O3
O : 3
Testing C
C : 1
Testing H2O
H : 2
O : 1
*/
}
private static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
{
Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();
string elementRegex = "([A-Z][a-z]*)([0-9]*)";
string validateRegex = "^(" + elementRegex + ")+$";
if (!Regex.IsMatch(chemicalFormula, validateRegex))
throw new FormatException("Input string was in an incorrect format.");
foreach (Match match in Regex.Matches(chemicalFormula, elementRegex))
{
string name = match.Groups[1].Value;
int count =
match.Groups[2].Value != "" ?
int.Parse(match.Groups[2].Value) :
1;
formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(name), count));
}
return formula;
}
The problem with your method is here:
// Add the chemical element and its atom count
if (countBuffer > 0)
When you don't have a number, count buffer will be 0, I think this will work
// Add the chemical element and its atom count
if (countBuffer > 0 || nameBuffer != String.Empty)
This will work when for formulas like HO2 or something like that.
I believe that your method will never insert into the formula
collection the las element of the chemical formula.
You should add the last element of the bufer to the collection before return the result, like this:
formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));
return formula;
}
first of all: I haven't used a parser generator in .net, but I'm pretty sure you could find something appropriate. This would allow you to write the grammar of Chemical Formulas in a far more readable form. See for example this question for a first start.
If you want to keep your approach: Is it possible that you do not add your last element no matter if it has a number or not? You might want to run your loop with i<= chemicalFormula.Length
and in case of i==chemicalFormula.Length
also add what you have to your Formula. You then also have to remove your if (countBuffer > 0)
condition because countBuffer can actually be zero!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With