Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to strip out greater than > and less than < characters from HTML string ignoring existing tags

Tags:

java

html

regex

I have not a lot of experience with regular expression and have an issue where I need to replace all instances of > and < with &lt; and &gt; but to leave the HTML tags in tack.

For example:

String string =" <p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>O is > 1 and < 100 <p>";
//need to be converted to:
<p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>O is  &gt; 1 and  &lt; 100 <p>";

I have tried some look and ahead and behind expressions but I can not seem to get any of them to work. For example:

String string =" <p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>) is > 1 and < 100 <p>";

String reg1="<(?=[^>\\/]*<\\/)";


Pattern p1 = Pattern.compile(reg1);

test = p1.matcher(string).replaceAll("&lt;");

Does not seem to have any effect.

I wondered if anyone else had come across this before or if anyone can give me any guidance?

like image 781
Megan Eisenbraun Avatar asked Oct 20 '22 10:10

Megan Eisenbraun


2 Answers

If all < and > are only present in their escaped version (&lt; and &gt;) you would be able to match and remove them using regex.

But if they aren't (which seems to be your case), ultimately, you can't match with 100% accuracy only using regex due to the nested nature of the HTML/XML tags.

Your best bet is an HTML Parser, like jsoup:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupExtractGtLt {
    public static void main(String[] args) {
        String html = "<p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>) is > 1 and < 100 <p>";
        Document doc = Jsoup.parseBodyFragment(html);
        String parsedHTML = doc.body().unwrap().toString();
        System.out.println(parsedHTML);
    }
}

Output:

 <p class="anotherClass"> Here is some text the value is for H<sub>2</sub>) is &gt; 1 and &lt; 100 </p>
like image 85
acdcjunior Avatar answered Oct 22 '22 03:10

acdcjunior


Using regex alone to "parse" HTML markup comes with some hefty caveats, which many, many folks here on SA have commented on. However, your request is relatively modest.

Naked < symbols between tags can be found with <(?=[^>]*(?:<|$)) and replaced by &lt;.

Naked > symbols between tags can be found with ((?:^|>)[^<]*?)> and replaced by \1&gt;.

Note that both must be done on the whole string (not by line). E.g. . must match \n, ^ must match the beginning of the string (not the line), and $ must match the end of the string (not the line).

Note also that each must be performed multiple times until no results are left, since only one replacement can be made at a time between tags.

Caveats:

  • This only finds and replaces stray < or > symbols This between tags, NOT in tags themselves. That means that it will mess up on something like <a href="/link/with/</symbol/in/it">.
  • You should, if practical, have a human check the resulting changes for validity, or at least run it through an automated checker.
  • These regexes are time-expensive, so may not be practical if speed is an issue.

To reiterate points made by others, please consider a markup parser instead, if doing any work with untrusted inputs.

like image 39
Pi Marillion Avatar answered Oct 22 '22 03:10

Pi Marillion