Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read UTF-16 chars from a file and store them as UTF-8

Tags:

java

file

utf-8

I have a Person pojo, with a name attribute which I store in my database within the respective persons table. My db server is MySQL with utf-8 set as the default server encoding, the persons table is an InnoDB table which was also created with utf-8 as the default encoding, and my db connection string specifies utf-8 as the connection encoding.

I am required to create and store new Person pojos, by reading their names from a txt file (persons.txt) which contains a name in every line, but the file encoding is UTF-16.

persons.txt

John

Μαρία

Hélène

etc..

Here is a sample code:

PersonDao dao = new PersonDao();
File file = new File("persons.txt");
BufferedReader reader = new BufferedReader(
                        new InputStreamReader(new FileInputStream(file), "UTF-16"));
String line = reader.readLine();
while (line!=null) {
    Person p = new Person();
    p.setName(line.trim());
    dao.save(p);
    line = reader.readLine();
}

To sum up, I am reading string characters as utf-16, store them in local variables and persist them as utf-8.

I would like to ask: Does any character conversion take place during this procedure? If yes, then at what point does this happen? Is it possible that I may end up storing broken characters due to the utf-16 -> utf-8 workflow?

like image 790
Argyro Kazaki Avatar asked Feb 24 '11 12:02

Argyro Kazaki


People also ask

Is UTF-8 and UTF-16 the same?

The DifferenceUtf-8 and utf-16 both handle the same Unicode characters. They are both variable length encodings that require up to 32 bits per character. The difference is that Utf-8 encodes the common characters including English and numbers using 8-bits. Utf-16 uses at least 16-bits for every character.

Does UTF-16 have more characters than UTF-8?

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.


1 Answers

InputStreamReader converts characters from their external representation in the specified encoding (UTF-16 in your case) to the internal representation (i.e. char, String), that is always UTF-16 too, so effectively there is no conversion here in your case.

Internal representation of Strings should be converted to the database encoding by your JDBC driver, so you shouldn't care about it (though in the case of MySQL you should care about specifying the proper database encoding in the connection string).

If input encoding and (in the case of MySQL) database encoding are specified correctly, there are no chances of data loss during conversions, since both UTF-8 and UTF-16 are used to represent the same character set.

like image 129
axtavt Avatar answered Nov 09 '22 22:11

axtavt