Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chemical formula parser C++

Tags:

c++

I am currently working on a program that can parse a chemical formula and return molecular weight and percent composition. The following code works very well with compounds such as H2O, LiOH, CaCO3, and even C12H22O11. However, it is not capable of understanding compounds with polyatomic ions that lie within parenthesis, such as (NH4)2SO4.

I am not looking for someone to necessarily write the program for me, but just give me a few tips on how I might accomplish such a task.

Currently, the program iterates through the inputted string, raw_molecule, first finding each element's atomic number, to store in a vector (I use a map<string, int> to store names and atomic #). It then finds the quantities of each element.

bool Compound::parseString() {
map<string,int>::const_iterator search;
string s_temp;
int i_temp;

for (int i=0; i<=raw_molecule.length(); i++) {
    if ((isupper(raw_molecule[i]))&&(i==0))
        s_temp=raw_molecule[i];
    else if(isupper(raw_molecule[i])&&(i!=0)) {
        // New element- so, convert s_temp to atomic # then store in v_Elements
        search=ATOMIC_NUMBER.find (s_temp);
        if (search==ATOMIC_NUMBER.end()) 
            return false;// There is a problem
        else
            v_Elements.push_back(search->second); // Add atomic number into vector

        s_temp=raw_molecule[i]; // Replace temp with the new element

    }
    else if(islower(raw_molecule[i]))
        s_temp+=raw_molecule[i]; // E.g. N+=a which means temp=="Na"
    else
        continue; // It is a number/parentheses or something
}
// Whatever's in temp must be converted to atomic number and stored in vector
search=ATOMIC_NUMBER.find (s_temp);
if (search==ATOMIC_NUMBER.end()) 
    return false;// There is a problem
else
    v_Elements.push_back(search->second); // Add atomic number into vector

// --- Find quantities next --- // 
for (int i=0; i<=raw_molecule.length(); i++) {
    if (isdigit(raw_molecule[i])) {
        if (toInt(raw_molecule[i])==0)
            return false;
        else if (isdigit(raw_molecule[i+1])) {
            if (isdigit(raw_molecule[i+2])) {
                i_temp=(toInt(raw_molecule[i])*100)+(toInt(raw_molecule[i+1])*10)+toInt(raw_molecule[i+2]);
                v_Quantities.push_back(i_temp);
            }
            else {
                i_temp=(toInt(raw_molecule[i])*10)+toInt(raw_molecule[i+1]);
                v_Quantities.push_back(i_temp);
            }

        }
        else if(!isdigit(raw_molecule[i-1])) { // Look back to make sure the digit is not part of a larger number
            v_Quantities.push_back(toInt(raw_molecule[i])); // This will not work for polyatomic ions
        }
    }
    else if(i<(raw_molecule.length()-1)) {
        if (isupper(raw_molecule[i+1])) {
            v_Quantities.push_back(1);
        }
    }
    // If there is no number, there is only 1 atom. Between O and N for example: O is upper, N is upper, O has 1.
    else if(i==(raw_molecule.length()-1)) {
        if (isalpha(raw_molecule[i]))
            v_Quantities.push_back(1);
    }
}

return true;
}

This is my first post, so if I have included too little (or maybe too much) information, please forgive me.

like image 629
ad2476 Avatar asked Dec 09 '22 01:12

ad2476


1 Answers

While you might be able to do an ad-hoc scanner-like thing that can handle one level of parens, the canonical technique used for things like this is to write a real parser.

And there are two common ways to do that...

  1. Recursive descent
  2. Machine-generated bottom-up parser based on a grammar-specification file.

(And technically, there is a third category, PEG, that is machine-generated-top-down.)

Anyway, for case 1, you need to code a recursive call to your parser when you see a ( and then return from this level of recursion on the ) token.

Typically a tree-like internal representation is created; this is called a syntax tree, but in your case, you can probably skip that and just return the atomic weight from the recursive call, adding to the level you will be returning from the first instance.

For case 2, you need to use a tool like yacc to turn a grammar into a parser.

like image 180
DigitalRoss Avatar answered Jan 04 '23 22:01

DigitalRoss