Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to decode html codes using Java? [duplicate]

Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?

I need to extract paragraphs (like title in StackOverflow) from an html file.

I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.

EXAMPLE

field extracted:

Paging Lucene&#39s search results (with **;** among **&#39** and **s**)

field after decoding:

Paging Lucene's search results

Is there any class in java that will allow me to convert these html codes?

like image 254
user Avatar asked Dec 06 '12 18:12

user


People also ask

How do I decrypt HTML code?

Load the HTML data to decode from a file, then press the 'Decode' button: Browse: Alternatively, type or paste in the text you want to HTML–decode, then press the 'Decode' button.

Which method is used to decode the currently encoded HTML code?

The input string is encoded using the HtmlEncode method. The encoded string obtained is then decoded using the HtmlDecode method.

How do you escape HTML in Java?

In Java, we can use Apache commons-text , StringEscapeUtils. escapeHtml4(str) to escape HTML characters. In the old days, we usually use the Apache commons-lang3 , StringEscapeUtils class to escape HTML, but this class is deprecated as of 3.6.

How do I decode a string in HTML?

The HTML entities here are &lt; that is the < character, &quot; that is the double-quote character and &gt; that is the > character. In this example, we unescape obfuscated HTML code. Every character in the input string is an HTML entity and in the output, you get a decoded string made out of English letters.


2 Answers

Use methods provided by Apache Commons Lang

import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
like image 168
jlordo Avatar answered Sep 21 '22 13:09

jlordo


Do not try to solve everything by regexp.

While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.

See this question: RegEx match open tags except XHTML self-contained tags for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!

Chuck Norris can parse HTML with regex.

The bad news is: there is more than one way to encode characters.

https://en.wikipedia.org/wiki/Character_encodings_in_HTML

For example, the character 'λ' can be represented as &#955;, &#x03BB; or &#X03bb;

And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. &#153; for example is not valid, yet many browsers will interpret it as .

Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.

So I strongly recommend:

  • Feed string into a robust HTML parser
  • Get parsed (and fully decoded) string back
like image 35
Has QUIT--Anony-Mousse Avatar answered Sep 22 '22 13:09

Has QUIT--Anony-Mousse