Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining ISO-8859-1 vs US-ASCII charset

I am trying to determine whether to use

PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");

or

PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");

I was reading All about character sets to determine the character set of an example file which I must create in the same encoding via java code.

When my example file contains "European" letters (Norwegian: å ø æ), then the following command tells me the file encoding is "iso-8859-1"

file -bi example.txt

However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (let's say, I replace "Bjørn" with "Bjorn"), then the same command tells me the file encoding is "us-ascii".

file -bi example-no-european-letters.txt

What does this mean? Is ISO-8859-1 in practise the same as US-ASCII if there are no "European" characters in it?

Should I just use a charset "ISO-8559-1" and everything will be ok?

like image 862
vikingsteve Avatar asked Dec 24 '22 18:12

vikingsteve


2 Answers

If the file contains only the 7-bit US-ASCII characters it can be read as US-ASCII. It doesn't tell anything about what was intended as the charset. It may be just a coincidence that there were no characters that would require a different coding.

ISO-8859-1 (and -15) is a common european encoding, able to encode äöåéü and other characters, the first 127 characters being the same as in US-ASCII (as often is, for convenience reasons).

However you can't just pick an encoding and assume that "everything will be OK". The very common UTF-8 encoding also contains the US-ASCII charset, but it will encode for example äöå characters as two bytes instead of ISO-8859-1's one byte.

TL;DR: Don't assume things with encodings. Find out what was intended and use that. If you can't find it out, observe the data to try to figure out what is a correct charset to use (as you noted yourself, multiple encodings may work at least temporarily).

like image 155
Kayaman Avatar answered Dec 31 '22 11:12

Kayaman


It depends on different types of characters we use in the respective document. ASCII is 7-bit charset and ISO-8859-1 is 8-bit charset which supports some additional characters. But, mostly, if you are going to reproduce the document from inputstream, I recommend the ISO-8859-1 charset. It will work for textfile like notepad and MS word.

If you are using some different international characters, we need to check the corresponding charset which supports that particular character like UTF-8..

like image 32
Kaliappan Avatar answered Dec 31 '22 11:12

Kaliappan