Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stripping HTML tags in Java [duplicate]

Tags:

java

html

Is there an existing Java library which provides a method to strip all HTML tags from a String? I'm looking for something equivalent to the strip_tags function in PHP.

I know that I can use a regex as described in this Stackoverflow question, however I was curious if there may already be a stripTags() method floating around somewhere in the Apache Commons library that can be used.

like image 604
Todd Avatar asked May 07 '09 02:05

Todd


People also ask

How do you replace HTML tags in Java?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.

How do I strip a tag in HTML?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped. This cannot be changed with the allow parameter. Note: This function is binary-safe.

How do I strip a string in HTML?

To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.

Which function is used to remove all HTML tags from a string passed to a form?

Which function is used to remove all HTML tags from a string passed to a form? Explanation: The function strip_tags() is used to strip a string from HTML, XML, and PHP tags.


2 Answers

Use JSoup, it's well documented, available on Maven and after a day of spending time with several libraries, for me, it is the best one i can imagine.. My own opinion is, that a job like that, parsing html into plain-text, should be possible in one line of code -> otherwise the library has failed somehow... just saying ^^ So here it is, the one-liner of JSoup - in Markdown4J, something like that is not possible, in Markdownj too, in htmlCleaner this is pain in the ass with somewhat about 50 lines of code...

String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html)); 

And what you got is real plain-text (not just the html-source-code as a String, like in other libs lol) -> he really does a great job on that. It is more or less the same quality as Markdownify for PHP....

like image 184
jebbie Avatar answered Sep 21 '22 12:09

jebbie


This is what I found on google on it. For me it worked fine.

String noHTMLString = htmlString.replaceAll("\\<.*?\\>", ""); 
like image 33
Jakob Alexander Eichler Avatar answered Sep 24 '22 12:09

Jakob Alexander Eichler