How to remove only html tags from text with Jsoup?

Question

I want to remove ONLY html tags from text with JSOUP. I used solution from here (my previous question about JSOUP) But after some checkings I discovered that JSOUP gets JAVA heap exception: OutOfMemoryError for big htmls but not for all. For example, it fails on html 2Mb and 10000 lines. Code throws an exception in the last line (NOT on Jsoup.parse):

public String StripHtml(String html){
  html = html.replace("&lt;", "<").replace("&gt;", ">");
  String[] tags = getAllStandardHtmlTags;
  Document thing = Jsoup.parse(html);
  for (String tag : tags) {
      for (Element elem : thing.getElementsByTag(tag)) {
          elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
          elem.remove();
      }
  }
  return thing.html();
}

Is there a way to fix it?

Stephan · Accepted Answer

Alternatively, you can give a try to Jsoup cleaning capabilities. The code below will remove ALL html tags located in the passed html string.

public String StripHtml(String html) {
    return Jsoup.clean(html, Whitelist.none());
}

The whitelist (Whitelist.none()) tells the Jsoup cleaner which tags are allowed. As you can see, none html tags are allowed here. Any tags not referenced in the whitelist will be removed.

You may be interested by other provided whitelists:

Whitelist.basic()
Whitelist.basicWithImages()
Whitelist.none()
Whitelist.relaxed()
Whitelist.simpleText()

Those base whitelists can be customized by adding tags (see addTags method) or by removing tags (see removeTags method).

If you want to create your own whitelist (be careful !), here is the way to go:

Whitelist myCustomWhitelist = new Whitelist();
myCustomWhitelist.addTags("b", "em", ...);

See details here: Jsoup Whitelists

Jsoup 1.8.3

How to remove only html tags from text with Jsoup?

Tags:

java

html

out-of-memory

strip

jsoup

Rougher

1 Answers

Stephan

Recent Activity

Donate For Us

How to remove only html tags from text with Jsoup?

Tags:

java

html

out-of-memory

strip

jsoup

Rougher

1 Answers

Stephan

Related questions

Recent Activity

Donate For Us