I have a Person pojo, with a name attribute which I store in my database within the respective persons table. My db server is MySQL with utf-8 set as the default server encoding, the persons table is an InnoDB table which was also created with utf-8 as the default encoding, and my db connection string specifies utf-8 as the connection encoding.
I am required to create and store new Person pojos, by reading their names from a txt file (persons.txt) which contains a name in every line, but the file encoding is UTF-16.
persons.txt
John
Μαρία
Hélène
etc..
Here is a sample code:
PersonDao dao = new PersonDao();
File file = new File("persons.txt");
BufferedReader reader = new BufferedReader(
new InputStreamReader(new FileInputStream(file), "UTF-16"));
String line = reader.readLine();
while (line!=null) {
Person p = new Person();
p.setName(line.trim());
dao.save(p);
line = reader.readLine();
}
To sum up, I am reading string characters as utf-16, store them in local variables and persist them as utf-8.
I would like to ask: Does any character conversion take place during this procedure? If yes, then at what point does this happen? Is it possible that I may end up storing broken characters due to the utf-16 -> utf-8 workflow?
The DifferenceUtf-8 and utf-16 both handle the same Unicode characters. They are both variable length encodings that require up to 32 bits per character. The difference is that Utf-8 encodes the common characters including English and numbers using 8-bits. Utf-16 uses at least 16-bits for every character.
UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.
InputStreamReader
converts characters from their external representation in the specified encoding (UTF-16 in your case) to the internal representation (i.e. char
, String
), that is always UTF-16 too, so effectively there is no conversion here in your case.
Internal representation of String
s should be converted to the database encoding by your JDBC driver, so you shouldn't care about it (though in the case of MySQL you should care about specifying the proper database encoding in the connection string).
If input encoding and (in the case of MySQL) database encoding are specified correctly, there are no chances of data loss during conversions, since both UTF-8 and UTF-16 are used to represent the same character set.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With