How do I add closing nodes of a given node (<sec>
) in certain positions in a text file which is not a valid xml file. I know its a bit confusing but here is sample input text and here is its desired output
Basically the program should generate </sec>
node before the next <sec>
node and how many </sec>
's will it add to the required place depend on the attribute id
of the node <sec>
using the digits separated by a .
as follows:
if the next <sec>
node after say, <sec id="4.5">
is <sec id="5">
then 2 </sec>
should be added before <sec id="5">
if the next <sec>
node after say, <sec id="3.2.1.2">
is <sec id="3.4">
then 3 </sec>
nodes should be added before <sec id="3.4">
I cannot use any xml parsing methods to do this obviously, what other way can this be done....I'm clueless at this point... Can anyone help? sample input
<?xml version="1.0" encoding="utf-8"?>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>Tuberculosis is associated with high mortality rate although according to the clinical trials that have been documented</p>
<sec id="sec1.2">
<title>Related Work</title>
<p>The main contributions in this study are:
<list list-type="ordered">
<list-item><label>I.</label><p>Introducing SURF features descriptors for TB detection which for our knowledge has not been used in this problem before.</p></list-item>
<list-item><label>II.</label><p>Providing an extensive study of the effect of grid size on the accuracy of the SURF.</p></list-item>
</list></p>
</sec>
<sec id="sec1.3">
<title>Dataset</title>
<p>The dataset used in this work is a standard computerized images database for tuberculosis gathered and organized by National Library of Medicine in collaboration with the Department of Health and Human Services, Montgomery County, Maryland; USA <xref ref-type="bibr" rid="ref15">[15]</xref>. The set contains 138 x-rays, 80 for normal cases and 58 with TB infections. The images are annotated with clinical readings comes in text notes with the database describing age, gender, and diagnoses. The images comes in 12 bits gray levels, PNG format, and size of 4020*4892. The set contains x-ray images information gathered under Montgomery County's Tuberculosis screening program.</p>
<sec id="sec1.3.5">
<sec id="sec1.3.5.2">
<title>Methodologies</title>
<sec id="sec2">
<p>The majority of TB and death cases are in developing countries.</p>
<sec id="sec2.5">
<p>The disordered physiological manifestations associated with TB is diverse and leads to a complex pathological changes in the organs like the lungs.</p>
<sec id="sec2.5.3">
<sec id="sec2.5.3.1">
<p>The complexity and diversity in the pulmonary manifestations are reported to be caused by age.</p>
<sec id="sec2.5.3.1.1">
</sec>
</sec>
</body>
Desired output
<?xml version="1.0" encoding="utf-8"?>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>Tuberculosis is associated with high mortality rate although according to the clinical trials that have been documented</p>
<sec id="sec1.2">
<title>Related Work</title>
<p>The main contributions in this study are:
<list list-type="ordered">
<list-item><label>I.</label><p>Introducing SURF features descriptors for TB detection which for our knowledge has not been used in this problem before.</p></list-item>
<list-item><label>II.</label><p>Providing an extensive study of the effect of grid size on the accuracy of the SURF.</p></list-item>
</list></p>
</sec>
<sec id="sec1.3">
<title>Dataset</title>
<p>The dataset used in this work is a standard computerized images database for tuberculosis gathered and organized by National Library of Medicine in collaboration with the Department of Health and Human Services, Montgomery County, Maryland; USA <xref ref-type="bibr" rid="ref15">[15]</xref>. The set contains 138 x-rays, 80 for normal cases and 58 with TB infections. The images are annotated with clinical readings comes in text notes with the database describing age, gender, and diagnoses. The images comes in 12 bits gray levels, PNG format, and size of 4020*4892. The set contains x-ray images information gathered under Montgomery County's Tuberculosis screening program.</p>
<sec id="sec1.3.5">
<sec id="sec1.3.5.2">
<title>Methodologies</title>
</sec>
</sec>
</sec>
</sec>
<sec id="sec2">
<p>The majority of TB and death cases are in developing countries.</p>
<sec id="sec2.5">
<p>The disordered physiological manifestations associated with TB is diverse and leads to a complex pathological changes in the organs like the lungs.</p>
<sec id="sec2.5.3">
<sec id="sec2.5.3.1">
<p>The complexity and diversity in the pulmonary manifestations are reported to be caused by age.</p>
<sec id="sec2.5.3.1.1">
</sec>
</sec>
</sec>
</sec>
</sec>
</body>
In order to accomplish this task, I defined one additional method, which will return how many closing tags </sec>
should be inserted based on the difference in IDs:
public static int HowManyClosingTags(string startTagId, string endTagId)
{
// if IDs are the same, then we don't need any closing tags
if(startTagId == endTagId )
return 0;
// if following ID is subsection of previous tag section, then we don't need any closing tags
if (endTagId.IndexOf(startTagId) == 0)
return 0;
int i = 0;
while (startTagId[i] == endTagId[i])
i++;
return startTagId.Substring(i).Count(ch => ch == '.') + 1;
}
I work with string, as it's invalid XML and can't be loaded as one (XmlDocument.Load()
method throws exception in case of invalid XML). So I'm doing basic operations on strings (which I hope will be understandable in code, also I included as many comments as I could think of to make it clear). Below is the code:
static void Main(string[] args)
{
string invalidXml = "your invalid XML";
int closeTagPos = -1;
int openTagPos = -1;
string openTagId = "";
string closeTagId = "";
int howManyClosingTagsAlready;
int lastPos;
int howManyTagsToInsert;
while (true)
{
//get indexes of opening tag and close tag, break, if none is found
if((openTagPos = invalidXml.IndexOf("<sec id=\"sec", openTagPos + 1)) == -1)
break;
if((closeTagPos = invalidXml.IndexOf("<sec id=\"sec", openTagPos + 1)) == -1)
break;
//get the IDs of tags
openTagId = invalidXml.Substring(
openTagPos + 12,
invalidXml.IndexOf('"', openTagPos + 12) - openTagPos - 12
);
closeTagId = invalidXml.Substring(
closeTagPos + 12,
invalidXml.IndexOf('"', closeTagPos + 12) - closeTagPos - 12
);
//count how many tags were already closed
howManyClosingTagsAlready = 0;
lastPos = invalidXml.IndexOf("</sec>", openTagPos);
while (lastPos > -1 && lastPos < closeTagPos)
{
howManyClosingTagsAlready++;
lastPos = invalidXml.IndexOf("</sec>", lastPos + 1);
}
howManyTagsToInsert = HowManyClosingTags(openTagId, closeTagId) - howManyClosingTagsAlready;
for (int i = 0; i < howManyTagsToInsert; i++)
{
//insert closing tags
invalidXml = invalidXml.Insert(closeTagPos, "</sec>");
}
}
//now we have to close our last "unclosed" tag, in this case
//</body> is treated as colsing tag, the logic stays the same
openTagId = invalidXml.Substring(
openTagPos + 12,
invalidXml.IndexOf('"', openTagPos + 12) - openTagPos - 12
);
closeTagPos = invalidXml.IndexOf("</body>");
howManyClosingTagsAlready = 0;
lastPos = invalidXml.IndexOf("</sec>", openTagPos);
while (lastPos > -1 && lastPos < closeTagPos)
{
howManyClosingTagsAlready++;
lastPos = invalidXml.IndexOf("</sec>", lastPos + 1);
}
howManyTagsToInsert = openTagId.Count(ch => ch == '.') + 1 - howManyClosingTagsAlready;
for (int i = 0; i < howManyTagsToInsert; i++)
{
//insert closing tags
invalidXml = invalidXml.Insert(closeTagPos, "</sec>");
}
XmlDocument xml = new XmlDocument();
xml.LoadXml(invalidXml);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With