I am reading data from a text file with following properties:
Encoding: ANSI
File Type: PC
Now, the file contains lot of special characters like degree symbol(º) etc. I am reading this file using the following code:
File file = new File("C:\\X\\Y\\SpecialCharacter.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"));
If the file encoding is ANSI, the above code does not read the special characters properly e.x. the line in file:
"Lower heat and simmer until product reaches internal temperature of 165ºF" , reader.readLine() would output:
"Lower heat and simmer until product reaches internal temperature of 165�F"
When I changed the encoding for the file to UTF-8, the line reads as it is in the file without messing up the special characters.
My question, at what point does the data get messed up? When storing the data in the file or when reading it from the file? Opening the file in Notepad displays all the special characters properly. How does that happen ?
Hexdump output:
-0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F
00000000- 4C 6F 77 65 72 20 68 65 61 74 20 61 6E 64 20 73 [Lower heat and s]
00000001- 69 6D 6D 65 72 20 75 6E 74 69 6C 20 70 72 6F 64 [immer until prod]
00000002- 75 63 74 20 72 65 61 63 68 65 73 20 69 6E 74 65 [uct reaches inte]
00000003- 72 6E 61 6C 20 74 65 6D 70 65 72 61 74 75 72 65 [rnal temperature]
00000004- 20 6F 66 20 31 36 35 BA 46 [ of 165.F ]
"ANSI" is not a particular encoding - it's a whole collection of encodings. You need to use the right encoding when reading the file. For example, it's entirely possible that you're using the Windows-1252 encoding, which means you may want to try passing in "Cp1252" as the encoding name.
In fact, you're passing in "UTF-8" which isn't one of the encodings typically referred to as ANSI. You need to find out the exact encoding that the file uses, and then specify that in the InputStreamReader parameter.
My question, at what point does the data get messed up? When storing the data in the file or when reading it from the file?
Assuming the encoding is capable of representing all the characters you're interested in, it's only when you read the file. Basically, you're trying to read it as if it's in one encoding, when it's actually in another. Notepad is either performing some sort of heuristic encoding detection, or it happens to use the right default for this particular situation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With