Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert from HTML to UTF-8 in java

Tags:

java

html

utf-8

I have an ASCII String, with HTML entities, like:

 à
 ¨
 ç

I need this String to be without those entities and convert them into UTF-8 chars. Is there any easy way, in java to do that?

Where:

 Clazz.method("aà","UTF-8")

returns "aà"

or something like that?

like image 428
Llistes Sugra Avatar asked May 13 '10 10:05

Llistes Sugra


People also ask

How do I change the encoding to UTF-8?

Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.

What is UTF-8 in Java?

UTF-8 is a variable width character encoding. UTF-8 has the ability to be as condensed as ASCII but can also contain any Unicode characters with some increase in the size of the file. UTF stands for Unicode Transformation Format. The '8' signifies that it allocates 8-bit blocks to denote a character.

Does Java use UTF-8 or UTF-16?

encoding attribute, Java uses “UTF-8” character encoding by default. Character encoding basically interprets a sequence of bytes into a string of specific characters. The same combination of bytes can denote different characters in different character encoding.


1 Answers

Take a look at org.apache.commons.lang.StringEscapeUtils.unescapeHtml(...). Apparently it understands all character entities defined in HTML 4.

like image 138
Stephen C Avatar answered Oct 03 '22 19:10

Stephen C