I have following CDATA inside xml document:
<![CDATA[ <p xmlns="">Refer to the below: <br/>
</p>
<table xmlns:abc="http://google.com pic.xsd" cellspacing="1" class="c" type="custom" width="100%">
<tbody>
<tr xmlns="">
<th style="text-align: left">Basic offers...</th>
</tr>
<tr xmlns="">
<td style="text-align: left">Faster network</td>
<td style="text-align: left">
<ul>
<li>Session</li>
</ul>
</td>
</tr>
<tr xmlns="">
<td style="text-align: left">capabilities</td>
<td style="text-align: left">
<ul>
<li>Navigation,</li>
<li>message, and</li>
<li>contacts</li>
</ul>
</td>
</tr>
<tr xmlns="">
<td style="text-align: left">Data</td>
<td style="text-align: left">
<p>Here visit google for more info <a href="http://www.google.com" target="_blank"><font color="#0033cc">www.google.com</font></a>.</p>
<p>Remove this href tag <a href="/abc/def/{T}/t/1" target="_blank">Information</a> remove the tag.</p>
</td>
</tr>
</tbody>
</table>
<p xmlns=""><br/>
</p>
]]>
I want to some how scan for href="/abc/def and remove the href tag which starts with abc/def. In above example remove the href tag and just leave "Information" text inside the tag. CDATA can have more than one href tags with "abc/def... in it. I am using C# for this application. Can someone please help me and tell me how this can be done? Should i use regex or is there a way to do it with xml itself?
This is the regex i am trying:
"<a href=\"/abc/def/.*></a>"
I want to keep inner text of the a href tag just remove the tags. But above regex is not working.
Using HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
.Descendants("a")
.Where(n => n.Attributes.Any(a => a.Name == "href" && a.Value.StartsWith("/abc/def")))
.ToArray();
foreach(var node in nodes)
{
node.ParentNode.RemoveChild(node,true);
}
var newHtml = doc.DocumentNode.InnerHtml;
I'd use HtmlAgilityPack for this task. The task itself is quite simple: to select nodes by using xpath, and then remove them. The thing left is to get the result HTML:
It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
var doc = new HtmlDocument();
doc.LoadHtml(xml);
var anchors = doc.DocumentNode.SelectNodes("//a[starts-with(@href, '/abc/def')]");
foreach (var anchor in anchors.ToList())
anchor.Remove();
var result= doc.DocumentNode.OuterHtml;
This will get you exactly you want.
EDIT:
If you want to remove the href attribute only, change this line anchor.Remove() to this one anchor.Attributes["href"].Remove();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With