Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a chemical formula from a string in C#? [duplicate]

I am trying to parse a chemical formula (in the format, for example: Al2O3 or O3 or C or C11H22O12) in C# from a string. It works fine unless there is only one atom of a particular element (e.g. the oxygen atom in H2O). How can I fix that problem, and in addition, is there a better way to parse a chemical formula string than I am doing?

ChemicalElement is a class representing a chemical element. It has properties AtomicNumber (int), Name (string), Symbol (string). ChemicalFormulaComponent is a class representing a chemical element and atom count (e.g. part of a formula). It has properties Element (ChemicalElement), AtomCount (int).

The rest should be clear enough to understand (I hope) but please let me know with a comment if I can clarify anything, before you answer.

Here is my current code:

    /// <summary>
    /// Parses a chemical formula from a string.
    /// </summary>
    /// <param name="chemicalFormula">The string to parse.</param>
    /// <exception cref="FormatException">The chemical formula was in an invalid format.</exception>
    public static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
    {
        Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();

        string nameBuffer = string.Empty;
        int countBuffer = 0;

        for (int i = 0; i < chemicalFormula.Length; i++)
        {
            char c = chemicalFormula[i];

            if (!char.IsLetterOrDigit(c) || !char.IsUpper(chemicalFormula, 0))
            {
                throw new FormatException("Input string was in an incorrect format.");
            }
            else if (char.IsUpper(c))
            {
                // Add the chemical element and its atom count
                if (countBuffer > 0)
                {
                    formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));

                    // Reset
                    nameBuffer = string.Empty;
                    countBuffer = 0;
                }

                nameBuffer += c;
            }
            else if (char.IsLower(c))
            {
                nameBuffer += c;
            }
            else if (char.IsDigit(c))
            {
                if (countBuffer == 0)
                {
                    countBuffer = c - '0';
                }
                else
                {
                    countBuffer = (countBuffer * 10) + (c - '0');
                }
            }
        }

        return formula;
    }
like image 288
Jake Petroules Avatar asked Nov 07 '10 06:11

Jake Petroules


3 Answers

I rewrote your parser using regular expressions. Regular expressions fit the bill perfectly for what you're doing. Hope this helps.

public static void Main(string[] args)
{
    var testCases = new List<string>
    {
        "C11H22O12",
        "Al2O3",
        "O3",
        "C",
        "H2O"
    };

    foreach (string testCase in testCases)
    {
        Console.WriteLine("Testing {0}", testCase);

        var formula = FormulaFromString(testCase);

        foreach (var element in formula)
        {
            Console.WriteLine("{0} : {1}", element.Element, element.Count);
        }
        Console.WriteLine();
    }

    /* Produced the following output

    Testing C11H22O12
    C : 11
    H : 22
    O : 12

    Testing Al2O3
    Al : 2
    O : 3

    Testing O3
    O : 3

    Testing C
    C : 1

    Testing H2O
    H : 2
    O : 1
        */
}

private static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
{
    Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();
    string elementRegex = "([A-Z][a-z]*)([0-9]*)";
    string validateRegex = "^(" + elementRegex + ")+$";

    if (!Regex.IsMatch(chemicalFormula, validateRegex))
        throw new FormatException("Input string was in an incorrect format.");

    foreach (Match match in Regex.Matches(chemicalFormula, elementRegex))
    {
        string name = match.Groups[1].Value;

        int count =
            match.Groups[2].Value != "" ?
            int.Parse(match.Groups[2].Value) :
            1;

        formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(name), count));
    }

    return formula;
}
like image 196
Pieter van Ginkel Avatar answered Oct 22 '22 13:10

Pieter van Ginkel


The problem with your method is here:

            // Add the chemical element and its atom count
            if (countBuffer > 0)

When you don't have a number, count buffer will be 0, I think this will work

            // Add the chemical element and its atom count
            if (countBuffer > 0 || nameBuffer != String.Empty)

This will work when for formulas like HO2 or something like that. I believe that your method will never insert into the formula collection the las element of the chemical formula.

You should add the last element of the bufer to the collection before return the result, like this:

    formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));

    return formula;
}
like image 34
Edgar Hernandez Avatar answered Oct 22 '22 12:10

Edgar Hernandez


first of all: I haven't used a parser generator in .net, but I'm pretty sure you could find something appropriate. This would allow you to write the grammar of Chemical Formulas in a far more readable form. See for example this question for a first start.

If you want to keep your approach: Is it possible that you do not add your last element no matter if it has a number or not? You might want to run your loop with i<= chemicalFormula.Length and in case of i==chemicalFormula.Length also add what you have to your Formula. You then also have to remove your if (countBuffer > 0) condition because countBuffer can actually be zero!

like image 37
Philipp Avatar answered Oct 22 '22 12:10

Philipp