I have not a lot of experience with regular expression and have an issue where I need to replace all instances of >
and <
with <
and >
but to leave the HTML tags in tack.
For example:
String string =" <p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>O is > 1 and < 100 <p>";
//need to be converted to:
<p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>O is > 1 and < 100 <p>";
I have tried some look and ahead and behind expressions but I can not seem to get any of them to work. For example:
String string =" <p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>) is > 1 and < 100 <p>";
String reg1="<(?=[^>\\/]*<\\/)";
Pattern p1 = Pattern.compile(reg1);
test = p1.matcher(string).replaceAll("<");
Does not seem to have any effect.
I wondered if anyone else had come across this before or if anyone can give me any guidance?
If all <
and >
are only present in their escaped version (<
and >
) you would be able to match and remove them using regex.
But if they aren't (which seems to be your case), ultimately, you can't match with 100% accuracy only using regex due to the nested nature of the HTML/XML tags.
Your best bet is an HTML Parser, like jsoup:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupExtractGtLt {
public static void main(String[] args) {
String html = "<p class=\"anotherClass\"> Here is some text the value is for H<sub>2</sub>) is > 1 and < 100 <p>";
Document doc = Jsoup.parseBodyFragment(html);
String parsedHTML = doc.body().unwrap().toString();
System.out.println(parsedHTML);
}
}
Output:
<p class="anotherClass"> Here is some text the value is for H<sub>2</sub>) is > 1 and < 100 </p>
Using regex alone to "parse" HTML markup comes with some hefty caveats, which many, many folks here on SA have commented on. However, your request is relatively modest.
Naked <
symbols between tags can be found with <(?=[^>]*(?:<|$))
and replaced by <
.
Naked >
symbols between tags can be found with ((?:^|>)[^<]*?)>
and replaced by \1>
.
Note that both must be done on the whole string (not by line). E.g. .
must match \n
, ^
must match the beginning of the string (not the line), and $
must match the end of the string (not the line).
Note also that each must be performed multiple times until no results are left, since only one replacement can be made at a time between tags.
Caveats:
<
or >
symbols This between tags, NOT in tags themselves. That means that it will mess up on something like <a href="/link/with/</symbol/in/it">
.To reiterate points made by others, please consider a markup parser instead, if doing any work with untrusted inputs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With