I need a regex to find all '<' or '>' which is not xml-tags.
Example:
<tag1>W<E><E</tag1>Z<>S
Should find
<><<>
Example:
<tag1>W<E><E</E></tag1>Z<>S
Should find
<<>
So, any hits where '<' or '>' is not in a tag (yes we alos have al self-closing tags which should be taken into consideration :)
Edit #2: What I want to do in the end is to replace all matches with the html-encoded values.
Edit #3:
So what I want to do is from a text containing HTML with some additional tags (very few known tags) get all '<' and '>' which is not included in the tags.
Example (the bold ones I want to find so I can replace them with their encoded values):
<div>
<a href="link">Link with < characters</a>
<knownTag>Text with character ></knownTag>
<knownTag>Text < again ></knownTag>
<div>
Result should be:
<div>
<a href="link">Link with < characters</a>
<knownTag>Text with character ></knownTag>
<knownTag>Text < again ></knownTag>
<div>
Any idea on how to solve this problem?
This can be done with regex; however, it's not as simple as you suggest. You will need to find valid tags and process them in order to make this work. It just so happens that I did this some time ago when writing a fast and lightwieght xml/html parser. The code is available at:
http://csharptest.net/browse/src/Library/Html/XmlLightParser.cs http://csharptest.net/browse/src/Library/Html/XmlLightInterfaces.cs
To use the parser, you will implement the defined interface IXmlLightReader
from the later of the two source files. The following example produces your desired results, and also handles several other capabilities you did not mention, like CDATA sections, processing instructions, DTDs, etc.
class RegexForBadXml
{
const string Input = "<?xml version=\"1.0\"?>\r\n<div>\r\n\t<a href=\"link\">Link with < characters</a>\r\n\t<knownTag>Text with character > &and other &#BAD; stuff</knownTag>\r\n\t<knownTag>Text < again ></knownTag>\r\n\t<knownTag><![CDATA[ Text < again > ]]></knownTag>\r\n<div>";
private static void Main()
{
var output = new StringWriter();
XmlLightParser.Parse(Input, XmlLightParser.AttributeFormat.Html, new OutputFormatter(output));
Console.WriteLine(output.ToString());
}
private class OutputFormatter : IXmlLightReader
{
private readonly TextWriter _output;
public OutputFormatter(TextWriter output)
{
_output = output;
}
void IXmlLightReader.StartDocument() { }
void IXmlLightReader.EndDocument() { }
public void StartTag(XmlTagInfo tag)
{
_output.Write(tag.UnparsedTag);
}
public void EndTag(XmlTagInfo tag)
{
_output.Write(tag.UnparsedTag);
}
public void AddText(string content)
{
_output.Write(HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(content)));
}
public void AddComment(string comment)
{
_output.Write(comment);
}
public void AddCData(string cdata)
{
_output.Write(cdata);
}
public void AddControl(string cdata)
{
_output.Write(cdata);
}
public void AddInstruction(string instruction)
{
_output.Write(instruction);
}
}
}
The preceeding program outputs the following results:
<?xml version="1.0"?>
<div>
<a href="link">Link with < characters</a>
<knownTag>Text with character > &and other &BAD; stuff</knownTag>
<knownTag>Text < again ></knownTag>
<knownTag><![CDATA[ Text < again > ]]></knownTag>
<div>
Note: I added the xml declaration, CDATA, and '&' text for testing only.
use one of method from This question and remove html tags of the input
then
string output = new string(input.ToCharArray().Where(c=> c=='<'||c=='>').ToArray());
Judging from your example, it seems you are not searching XML files as the subject suggests, but rather XML-like files - perhaps files that would be XML if they did not contain the "<" and ">" characters that you are looking for.
But you have not specified the task clearly enough. What should happen, for example, with
<tag1>xxxx</tag2>
or with
<tag1><x a="</tag1>"/></tag1>
Picking up the second case is pretty tough (perhaps impossible) to achieve with regular expressions alone. You need to define the grammar or the input language you want to accept (an extension of XML) and parse it using recursive parsing techniques.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With