Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate closing nodes in a file that is not a valid xml file?

Tags:

c#

How do I add closing nodes of a given node (<sec>) in certain positions in a text file which is not a valid xml file. I know its a bit confusing but here is sample input text and here is its desired output

Basically the program should generate </sec> node before the next <sec> node and how many </sec>'s will it add to the required place depend on the attribute id of the node <sec> using the digits separated by a . as follows:

if the next <sec> node after say, <sec id="4.5"> is <sec id="5"> then 2 </sec> should be added before <sec id="5">

if the next <sec> node after say, <sec id="3.2.1.2"> is <sec id="3.4"> then 3 </sec> nodes should be added before <sec id="3.4">

I cannot use any xml parsing methods to do this obviously, what other way can this be done....I'm clueless at this point... Can anyone help? sample input

<?xml version="1.0" encoding="utf-8"?>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>Tuberculosis is associated with high mortality rate although according to the clinical trials that have been documented</p>
<sec id="sec1.2">
<title>Related Work</title>
<p>The main contributions in this study are:
<list list-type="ordered">
<list-item><label>I.</label><p>Introducing SURF features descriptors for TB detection which for our knowledge has not been used in this problem before.</p></list-item>
<list-item><label>II.</label><p>Providing an extensive study of the effect of grid size on the accuracy of the SURF.</p></list-item>
</list></p>
</sec>
<sec id="sec1.3">
<title>Dataset</title>
<p>The dataset used in this work is a standard computerized images database for tuberculosis gathered and organized by National Library of Medicine in collaboration with the Department of Health and Human Services, Montgomery County, Maryland; USA <xref ref-type="bibr" rid="ref15">[15]</xref>. The set contains 138 x-rays, 80 for normal cases and 58 with TB infections. The images are annotated with clinical readings comes in text notes with the database describing age, gender, and diagnoses. The images comes in 12 bits gray levels, PNG format, and size of 4020*4892. The set contains x-ray images information gathered under Montgomery County&#x0027;s Tuberculosis screening program.</p>
<sec id="sec1.3.5">
<sec id="sec1.3.5.2">
<title>Methodologies</title>
<sec id="sec2">
<p>The majority of TB and death cases are in developing countries.</p>
<sec id="sec2.5">
<p>The disordered physiological manifestations associated with TB is diverse and leads to a complex pathological changes in the organs like the lungs.</p>
<sec id="sec2.5.3">
<sec id="sec2.5.3.1">
<p>The complexity and diversity in the pulmonary manifestations are reported to be caused by age.</p>
<sec id="sec2.5.3.1.1">
</sec>
</sec>
</body>

Desired output

<?xml version="1.0" encoding="utf-8"?>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>Tuberculosis is associated with high mortality rate although according to the clinical trials that have been documented</p>
<sec id="sec1.2">
<title>Related Work</title>
<p>The main contributions in this study are:
<list list-type="ordered">
<list-item><label>I.</label><p>Introducing SURF features descriptors for TB detection which for our knowledge has not been used in this problem before.</p></list-item>
<list-item><label>II.</label><p>Providing an extensive study of the effect of grid size on the accuracy of the SURF.</p></list-item>
</list></p>
</sec>
<sec id="sec1.3">
<title>Dataset</title>
<p>The dataset used in this work is a standard computerized images database for tuberculosis gathered and organized by National Library of Medicine in collaboration with the Department of Health and Human Services, Montgomery County, Maryland; USA <xref ref-type="bibr" rid="ref15">[15]</xref>. The set contains 138 x-rays, 80 for normal cases and 58 with TB infections. The images are annotated with clinical readings comes in text notes with the database describing age, gender, and diagnoses. The images comes in 12 bits gray levels, PNG format, and size of 4020*4892. The set contains x-ray images information gathered under Montgomery County&#x0027;s Tuberculosis screening program.</p>
<sec id="sec1.3.5">
<sec id="sec1.3.5.2">
<title>Methodologies</title>
</sec>
</sec>
</sec>
</sec>
<sec id="sec2">
<p>The majority of TB and death cases are in developing countries.</p>
<sec id="sec2.5">
<p>The disordered physiological manifestations associated with TB is diverse and leads to a complex pathological changes in the organs like the lungs.</p>
<sec id="sec2.5.3">
<sec id="sec2.5.3.1">
<p>The complexity and diversity in the pulmonary manifestations are reported to be caused by age.</p>
<sec id="sec2.5.3.1.1">
</sec>
</sec>
</sec>
</sec>
</sec>
</body>
like image 244
Don_B Avatar asked Mar 24 '18 14:03

Don_B


1 Answers

In order to accomplish this task, I defined one additional method, which will return how many closing tags </sec> should be inserted based on the difference in IDs:

public static int HowManyClosingTags(string startTagId, string endTagId)
{
   // if IDs are the same, then we don't need any closing tags
   if(startTagId == endTagId )
      return 0;
   // if following ID is subsection of previous tag section, then we don't need any closing tags
   if (endTagId.IndexOf(startTagId) == 0)
      return 0;

   int i = 0;
   while (startTagId[i] == endTagId[i])
      i++;

   return startTagId.Substring(i).Count(ch => ch == '.') + 1;
}

I work with string, as it's invalid XML and can't be loaded as one (XmlDocument.Load() method throws exception in case of invalid XML). So I'm doing basic operations on strings (which I hope will be understandable in code, also I included as many comments as I could think of to make it clear). Below is the code:

static void Main(string[] args)
{
    string invalidXml = "your invalid XML";
    int closeTagPos = -1;
    int openTagPos = -1;
    string openTagId = "";
    string closeTagId = "";
    int howManyClosingTagsAlready;
    int lastPos;
    int howManyTagsToInsert;
    while (true)
    {
        //get indexes of opening tag and close tag, break, if none is found
        if((openTagPos = invalidXml.IndexOf("<sec id=\"sec", openTagPos + 1)) == -1)
            break;
        if((closeTagPos = invalidXml.IndexOf("<sec id=\"sec", openTagPos + 1)) == -1)
            break;
        //get the IDs of tags
        openTagId = invalidXml.Substring(
            openTagPos + 12,
            invalidXml.IndexOf('"', openTagPos + 12) - openTagPos - 12
        );
        closeTagId = invalidXml.Substring(
            closeTagPos + 12,
            invalidXml.IndexOf('"', closeTagPos + 12) - closeTagPos - 12
        );
        //count how many tags were already closed
        howManyClosingTagsAlready = 0;
        lastPos = invalidXml.IndexOf("</sec>", openTagPos);
        while (lastPos > -1 && lastPos < closeTagPos)
        {
            howManyClosingTagsAlready++;
            lastPos = invalidXml.IndexOf("</sec>", lastPos + 1);
        }

        howManyTagsToInsert = HowManyClosingTags(openTagId, closeTagId) - howManyClosingTagsAlready;
        for (int i = 0; i < howManyTagsToInsert; i++)
        {
            //insert closing tags
            invalidXml = invalidXml.Insert(closeTagPos, "</sec>");
        }
    }
    //now we have to close our last "unclosed" tag, in this case
    //</body> is treated as colsing tag, the logic stays the same
    openTagId = invalidXml.Substring(
        openTagPos + 12,
        invalidXml.IndexOf('"', openTagPos + 12) - openTagPos - 12
    );
    closeTagPos = invalidXml.IndexOf("</body>");
    howManyClosingTagsAlready = 0;
    lastPos = invalidXml.IndexOf("</sec>", openTagPos);
    while (lastPos > -1 && lastPos < closeTagPos)
    {
        howManyClosingTagsAlready++;
        lastPos = invalidXml.IndexOf("</sec>", lastPos + 1);
    }

    howManyTagsToInsert = openTagId.Count(ch => ch == '.') + 1 - howManyClosingTagsAlready;

    for (int i = 0; i < howManyTagsToInsert; i++)
    {
        //insert closing tags
        invalidXml = invalidXml.Insert(closeTagPos, "</sec>");
    }

    XmlDocument xml = new XmlDocument();
    xml.LoadXml(invalidXml);
}
like image 172
Michał Turczyn Avatar answered Oct 05 '22 02:10

Michał Turczyn