Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding issue in Java

I have a CSV file that contains both ASCII & Unicode characters. say "ÅÔÉA". I am not sure abt the encoding format of this file, but when I open it in Notepad, it shows "ANSI" as its encoding standard.

I fetch these contents of CSV in UTF-8 encoded format.

fr = new InputStreamReader(new FileInputStream(fileName),"UTF-8");

but when I store it in DB these special characters, except "A", are not stored properly. the characters get scrambled

I wish all the characters to be stored properly. Any idea?

like image 448
user127377 Avatar asked Jun 23 '09 06:06

user127377


People also ask

What is the problem of encoding?

Once you have encoded something, you need to store it and be able to recall it. Problems with these last two stages are associated with conditions like dementia. But for most younger people, the problem lies in the encoding. Doing too many things at once means we're not able to give proper attention to any one task.

What is an encoding in Java?

Encoding is a way to convert data from one format to another. String objects use UTF-16 encoding. The problem with UTF-16 is that it cannot be modified. There is only one way that can be used to get different encoding i.e. byte[] array. The way of encoding is not suitable if we get unexpected data.

Is Java UTF-8 or 16?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.


2 Answers

"ANSI" in "Notepad" means whatever codepage your windows is using. Try ISO8859-1, it work in most case.

like image 165
J-16 SDiZ Avatar answered Oct 30 '22 08:10

J-16 SDiZ


First of all, you need to know the encoding of the file. Open it with a hexeditor. How many byte does a character occupy? If it is only one, then the file is not in UTF-8, but more likely in some ISO-8859 or a similar Windows encoding (e.g. Win-1252). As mentioned before, chances are that ISO-8859-1 is the right encoding. For Eastern Europe languages, ISO-8859-2 would be the right choice.

The second issue would be the character set your database supports for character columns (this parameter is set during installation / creation of a new instance) but since you can insert those characters directly, it wont's be a problem in that case.

Which jdbc driver do you use? The thin driver should not make any problems in that regard, while the OCI driver could create a additional layer of problems if the client's NLS_LANG setting doesn't match the database's character encoding.

like image 43
Erich Kitzmueller Avatar answered Oct 30 '22 07:10

Erich Kitzmueller