Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove html tags from string using java [duplicate]

Tags:

java

html

string

I am writing one program which reads and separate spam and ham emails. Now I am reading it using bufferedreader class of java. I am able to remove any unwanted characters like '(' or '.' etc, using replaceAll() method. I want to remove html tags too, including &amp. How to achieve this!?

thanks

EDIT: Thanks for the response, but I am already having a regex, how to combine both my needs and put into one. Heres the regex i am using now.

lines.replaceAll("[^a-zA-Z]", " ")

Note: I am getting lines from a txt file. Any other suggestions plss?!

like image 282
Maverick Avatar asked Dec 13 '10 19:12

Maverick


People also ask

How do you replace HTML tag from string in Java?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.

How do I strip a string in HTML?

To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.

Which function is used to remove all HTML tags from a string passed to a form?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped.


1 Answers

Maybe this will work:

String noHTMLString = htmlString.replaceAll("\\<.*?>","");

It uses regular expressions to remove all HTML tags in a string.

More specifically, it removes all XML like tags from a string. So <1234> will be removed even though it is not a valid HTML tag. But it's good for most intents and purposes.

Hope this helps.

like image 107
mishmash Avatar answered Oct 19 '22 22:10

mishmash