Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?
I need to extract paragraphs (like title
in StackOverflow) from an html
file.
I can use regular expressions in Java to extract the fields I need but I have to decode
the fields obtained.
EXAMPLE
field extracted:
Paging Lucene's search results (with **;** among **'** and **s**)
field after decoding:
Paging Lucene's search results
Is there any class in java that will allow me to convert these html codes?
Load the HTML data to decode from a file, then press the 'Decode' button: Browse: Alternatively, type or paste in the text you want to HTML–decode, then press the 'Decode' button.
The input string is encoded using the HtmlEncode method. The encoded string obtained is then decoded using the HtmlDecode method.
In Java, we can use Apache commons-text , StringEscapeUtils. escapeHtml4(str) to escape HTML characters. In the old days, we usually use the Apache commons-lang3 , StringEscapeUtils class to escape HTML, but this class is deprecated as of 3.6.
The HTML entities here are < that is the < character, " that is the double-quote character and > that is the > character. In this example, we unescape obfuscated HTML code. Every character in the input string is an HTML entity and in the output, you get a decoded string made out of English letters.
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
Do not try to solve everything by regexp.
While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.
See this question: RegEx match open tags except XHTML self-contained tags for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!
Chuck Norris can parse HTML with regex.
The bad news is: there is more than one way to encode characters.
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
For example, the character 'λ' can be represented as
λ
,λ
orλ
And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. ™
for example is not valid, yet many browsers will interpret it as ™
.
Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.
So I strongly recommend:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With