Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java convert String windows-1251 to utf8

Scanner sc = new Scanner(System.in);
    System.out.println("Enter text: ");
    String text = sc.nextLine();
    try {
        String result = new String(text.getBytes("windows-1251"), Charset.forName("UTF-8"));
        System.out.println(result);
    } catch (UnsupportedEncodingException e) {
        System.out.println(e);
    }

I'm trying change keyboard: input cyrylic keyboard, output latin. Example: qwerty +> йцукен

It doesn't work, can anyone tell me what i'm doing wrong?

like image 725
halem Avatar asked Nov 18 '14 13:11

halem


2 Answers

First java text, String/char/Reader/Writer is internally Unicode, so it can combine all scripts. This is a major difference with for instance C/C++ where there is no such standard.

Now System.in is an InputStream for historical reasons. That needs an indication of encoding used.

Scanner sc = new Scanner(System.in, "Windows-1251");

The above explicitly sets the conversion for System.in to Cyrillic. Without this optional parameter the default encoding is taken. If that was not changed by the software, it would be the platform encoding. So this might have been correct too.

Now text is correct, containing the Cyrillic from System.in as Unicode.

You would get the UTF-8 bytes as:

byte[] bytes = text.getBytes(StandardCharsets.UTF_8);

The old "recoding" of text was wrong; drop this line. in fact not all Windows-1251 bytes are valid UTF-8 multi-byte sequences.

String result = text;

System.out.println(result);

System.out is a PrintStream, a rather rarely used historic class. It prints using the default platform encoding. More or less rely on it, that the default encoding is correct.

System.out.println(result);

For printing to an UTF-8 encoded file:

byte[] bytes = ("\uFEFF" + text).getBytes(StandardCharsets.UTF_8);
Path path = Paths.get("C:/Temp/test.txt");
Files.writeAllBytes(path, bytes);

Here I have added a Unicode BOM character in front, so Windows Notepad may recognize the encoding as UTF-8. In general one should evade using a BOM. It is a zero-width space (=invisible) and plays havoc with all kind of formats: CSV, XML, file concatenation, cut-copy-paste.

like image 195
Joop Eggen Avatar answered Sep 28 '22 10:09

Joop Eggen


The reason why you have gotten the answer to a different question, and nobody answered yours, is because your title doesn't fit the question. You were not attempting to convert between charsets, but rather between keyboard layouts.

Here you shouldn't worry about character layout at all, simply read the line, convert it to an array of characters, go through them and using a predefined map convert these.

The code will be something like this:

Map<char, char> table = new TreeMap<char, char>();
table.put('q', 'й');
table.put('Q', 'Й');
table.put('w', 'ц');
// .... etc

String text = sc.nextLine();
char[] cArr = text.toCharArray();
for(int i=0; i<cArr.length; ++i)
{
  if(table.containsKey(cArr[i]))
  {
    cArr[i] = table.get(cArr[i]);
  }
}
text = new String(cArr);
System.out.println(text);

Now, i don't have time to test that code, but you should get the idea of how to do your task.

like image 42
v010dya Avatar answered Sep 28 '22 11:09

v010dya