Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting Info from Plain Text and Writing to XML Using DOM

Currently, I'm designing some format conversion tools in the area of glycobiology. The format conversion involves going from a text file to an XML file that is standard in the field. Most of the time, the data we get contains the information of interest in a plain text file like below. The actual file has all of this in one line. Reading and splitting this text to get the information is trivial (probably not intuitive) but XML is where the problem is.

[][b-D-GlcpNAc]
    {[(4+1)][b-D-GlcpNAc]
        {[(4+1)][b-D-Manp]
            {[(3+1)][a-D-Manp]
                {[(2+1)][a-D-Manp]{}
            }
        [(6+1)][a-D-Manp]
            {[(3+1)][a-D-Manp]{}
            [(6+1)][a-D-Manp]{}
        }
    }
}

How to interpret this:

  1. Everything of the form w-w-w+ is a sugar that is linked to another one. Linkage is shown by the curly {.
  2. 4+1, 3+1 and so on indicate which carbon bonds on one sugar to the other one. So the 4th carbon on the preceding one links to the 1st carbon on the succeeding one.
  3. {} This indicates no additional sugar linked to that sugar
  4. } curlies just close that tier.

You can probably read the XML and figure out how the linkages work. But if you guys would prefer a more detailed explanation, just ask.

What the XML should look like is shown below.

<?xml version="1.0" encoding="UTF-8"?>
<GlydeII>
    <molecule subtype="glycan" id="From_GlycoCT_Translation">
            <residue subtype="base_type" partid="1" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=b-dglc-HEX-1:5" />
            <residue subtype="substituent" partid="2" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=n-acetyl" />
            <residue subtype="base_type" partid="3" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=b-dglc-HEX-1:5" />
            <residue subtype="substituent" partid="4" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=n-acetyl" />
            <residue subtype="base_type" partid="5" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=b-dman-HEX-1:5" />
            <residue subtype="base_type" partid="6" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=a-dman-HEX-1:5" />
            <residue subtype="base_type" partid="7" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=a-dman-HEX-1:5" />
            <residue subtype="base_type" partid="8" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=a-dman-HEX-1:5" />
            <residue subtype="base_type" partid="9" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=a-dman-HEX-1:5" />
            <residue subtype="base_type" partid="10" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=a-dman-HEX-1:5" />
            <residue_link from="2" to="1">
                <atom_link from="N1H" to="C2" to_replace="O2" bond_order="1" />
            </residue_link>
            <residue_link from="3" to="1">
                <atom_link from="C1" to="O4" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="4" to="3">
                <atom_link from="N1H" to="C2" to_replace="O2" bond_order="1" />
            </residue_link>
            <residue_link from="5" to="3">
                <atom_link from="C1" to="O4" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="6" to="5">
                <atom_link from="C1" to="O3" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="7" to="6">
                <atom_link from="C1" to="O2" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="8" to="5">
                <atom_link from="C1" to="O6" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="9" to="8">
                <atom_link from="C1" to="O3" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="10" to="8">
                <atom_link from="C1" to="O6" from_replace="O1" bond_order="1" />
            </residue_link>
    </molecule>
</GlydeII>

So far I've been trivially able to get all the residue fields and written them to XML. But I'm having trouble even writing pseudo code for the residue_link fields. Even if I can just get help and ideas on how to go about adding the linkage information in the xml I would appreciate it.

like image 465
arkestra Avatar asked Nov 14 '22 20:11

arkestra


1 Answers

Okay! Cool problem, it hurts my brain in a good way.

First... for my sanity I tabbed your raw data into a way that makes sense:

[][b-D-GlcpNAc] {
    [(4+1)][b-D-GlcpNAc] {
        [(4+1)][b-D-Manp] {
            [(3+1)][a-D-Manp] {
                [(2+1)][a-D-Manp] { }
            }
            [(6+1)][a-D-Manp] {
                [(3+1)][a-D-Manp] { }
                [(6+1)][a-D-Manp] { }   
            }
        }
    }

I think that the key to this is figuring out what the pairs are, and you want to programmatically figure out what level you're on.

Pseudocode:

hierarchy = 0
nextChar = getNextChar()
while (Parsing):
    if (nextChar = "{"):
        hierarchy += 1
    elif (nextChar = "}"):
        hierarchy -= 1
    if (nextChar = "["):
        storeSugar(hierarchy)

You'd also want to keep track of which sugar is the previous "parent" sugar.

like image 139
Civilian Avatar answered Dec 18 '22 09:12

Civilian