Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to remove html from text with Java

Tags:

java

html

regex

I've got an ArrayList<String> named fields. I'm trying to parse the HTML in each String using the replaceAll function, but I get the feeling that I'm screwing up the regex String (I got the 2nd regex here to represent a generic html expression). Can anyone give me some tips on how to correct myself here?

for(int j = 0; j<fields.size(); j++)    
{
    String k = fields.get(j);
    k.replaceAll("<br>", "\n");
    k.replaceAll("<(\"[^\"]*\"|'[^']*'|[^'\">])*>", "");
    k.replaceAll("&lt;", "<");
    k.replaceAll("&gt;", ">");
    fields.set(j, k);
}
like image 762
user1724159 Avatar asked Mar 10 '26 02:03

user1724159


1 Answers

Remember that strings are immutable, so you want to re-assign k each time you call replaceAll:

String k = fields.get(j);
k = k.replaceAll("<br>", "\n");
...
like image 101
arshajii Avatar answered Mar 12 '26 14:03

arshajii



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!