I have a JavaEE project, in which I use message properties files. The encoding of those file is set to UTF-8. In the file I use the german umlauts like ä
, ö
, ü
. The problem is, sometimes those characters are replaced with unicode like \uFFFD\uFFFD
, but not for every character. Now, I have a case where ä
and ü
are both replaced with \uFFFD\uFFFD
, but not for every occurring of ä
and ü
.
The Git diff shows me something like this:
mail.adresses=E-Mail hinzufügen: -mail.adresses.multiple=E-Mails durch Kommata getrennt hinzufügen. +mail.adresses.multiple=E-Mails durch Kommata getrennt hinzuf\uFFFD\uFFFDgen. mail.title=Einladungs-E-Mail box.preview=Vorschau box.share.text=Sie können jetzt die ausgewählten Bilder mit Ihren Freunden teilen. @@ -6880,7 +6880,7 @@ browser.cancel=Abbrechen browser.selectImage=übernehmen browser.starImage=merken browser.removeImage=Löschen -browser.searchForSimilarImages=ähnliche +browser.searchForSimilarImages=\uFFFD\uFFFDhnliche browser.clear_drop_box=löschen
Also, there are lines changed, which I have not touched. I don't understand why I get such a behavior. What could be the cause for the above problem?
My system:
Antergos / Arch Linux
System encoding UTF-8
Python 3.5.0 (default, Sep 20 2015, 11:28:25) [GCC 5.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getdefaultencoding() 'utf-8'
Eclipse Mars 1
If I use another Editor like Atom to edit those message properties files, I don't ran into this problem.
I also realized in a case, if I copy the original value browser.searchForSimilarImages=ähnliche
from Git diff and replace the wrong value browser.searchForSimilarImages=\uFFFD\uFFFDhnliche
in Eclipse with that, then I have the correct umlauts in the message properties file.
properties files are Latin1 (ISO-8859-1) encoded by definition. ISO-8859-1 as its default encoding. You can change this under: Preferences > General > Content Types.
In Eclipse, go to Preferences>General>Workspace and select UTF-8 as the Text File Encoding. This should set the encoding for all the resources in your workspace. Any components you create from now on using the default encoding should all match.
By default ISO 8859-1 character encoding is used for Eclipse properties file (read here), so if the file contains any character beyond ISO 8859-1 then it will not be processed as expected.
If you use Eclipse then you will notice that it implicitly converts the special character into \uXXXX equivalent. Try copying
会意字 / 會意字
into a properties file opened in Eclipse.
EDIT: As per comment from OP
Update the encoding of your Eclipse as shown below. If you set encoding as UTF-32 then even you can see Chinese character, which you cannot see generally.
How to change Encoding of properties file in Eclipse: See this Eclipse Bugzilla bug for more details, which talks about several other possibilities and in the end suggest what I have highlighted below.
Chinese characters can be seen in Eclipse after encoding is set properly:
If above doesn't work consistently for you (it does work for me and I never see encoding issues) then try this using some Eclipse plugin which handles encoding of properties or other files. For example Eclipse ResourceBundle Editor or Extended Resource-Bundle editor
I would recommend using Eclipse ResourceBundle Editor.
Another possibility to change encoding of file is using Edit --> Set Encoding
option. It really matters because it changes the default character set and file encoding. Play around with by changing encoding using Edit --> Set Encoding
option and do following Java sysout System.out.println("Default Charset=" + Charset.defaultCharset());
and System.out.println(System.getProperty("file.encoding"));
As an aside: 1
Process the properties file to have content with ISO 8859-1 character encoding by using native2ascii - Native-to-ASCII Converter
What native2ascii does: It converts all the non-ISO 8859-1 character in their equivalent \uXXXX. This is a good tool because you need not to search the \uXXXX equivalent of special character.
Usage for UTF-8: native2ascii -encoding utf8 e:\a.txt e:\b.txt
As an aside: 2
Every computer program whether an IDE, application server, web server, browser, etc. understands only bits, so it need to know how to interpret the bits to make expected sense out of it because depending upon encoding used, same bits can represent different characters. And that's where "Encoding" comes into picture by giving a unique identifier to represent a character so that all computer programs, diverse OS etc. knows exact right way to interpret it.
So, if you have written into a file using some encoding scheme, lets say UTF-8, and then reading using any editor but running with encoding scheme as UTF-8 then you can expect to get correct display.
Please do read my this answer to get more details but from browser-server perspective.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With