Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is it possible to use replaceAll() with wildcards

Tags:

java

html

string

Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.

What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>

To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")

That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;

like image 240
Deslyxia Avatar asked Jan 16 '23 14:01

Deslyxia


2 Answers

You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.

For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.

The "wildcard" in regular expressions that you want is the .* expression. Using your example:

String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);

This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:

"This &escape;String &anotherescape;Extended"

will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:

String removed = ampStr.replaceAll("&[^;]*;", "");

This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.

like image 58
Brian Avatar answered Jan 29 '23 11:01

Brian


The expression you want is:

s.replaceAll("&.*?;","");

But do you really want to be parsing HTML this way? You may be better off using an XML parser.

like image 44
Jon Lin Avatar answered Jan 29 '23 11:01

Jon Lin