Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Umlaut character not accepted via keyboard (codepage 65001, UTF-8) to be read by perl script

Please let me state first that this problem is strictly related to the perl diamond operator accepting input that has been directly typed on the keyboard.

Had I talked about the perl diamond operator accepting input that that has been piped or otherwise from text from a file, then yes, this would be a duplicate of question 519309 -- How do I read Utf-8 with diamond operator.

However, this is not about piped or file data, but rather about input that has been directly typed on the keyboard. Therefore, I argue, this question is not a duplicate of 519309.

Here are the details of my question:

I am trying to use umlaut characters ('ä', 'ö',' ü', ...) on my keyboard.

I have a very simple perl script that accepts a line from the keyboard and then immediately prints it out again to screen:

If I use umlaut characters with codepage 1252, then everything works as expected:

C:\>chcp 1252 & perl -CS -we"print '*** '; $txt = <>; print '--- ', $txt;"
Page de codes active : 1252
*** ü
--- ü

However, if I use the same umlaut characters with codepage 65001 (UTF-8), then I get a warning uninitialized value and the umlaut is not accepted:

C:\>chcp 65001 & perl -CS -we"print '*** '; $txt = <>; print '--- ', $txt;"
Page de codes active : 65001
*** ü
Use of uninitialized value $txt in print at -e line 1.
---

If I pipe the umlaut into my perl program, then I have no problem:

C:\>chcp 65001 & echo ü | perl -CS -we"print '*** '; $txt = <>; print '--- ', $txt;"
Page de codes active : 65001
*** --- ü

Why do I get this warning with codepage 65001 (UTF-8)?

I am using Windows 7 x64, with Strawberry Perl 5.22.

Just for the record, if I use pure batch commands (that is I don't use perl), then I can successfully key in umlaut characters with codepage 65001 (UTF-8).

C:\>chcp 65001 & set /p txt=*** & echo --- %txt%
Page de codes active : 65001
*** ü
--- ü

The question really is: Why is perl not able to accept umlaut characters by keyboard with codepage 65001, whereas the very same keyboard input, same codepage 65001, works ok as a pure dos batch command?

There seems to be something fundamently different between piping umlaut characters and typing umlaut characters directly from the keyboard.

Why is typing an umlaut character on the keyboard not working, whereas the same thing works perfectly fine as a piped character?

like image 394
user2288349 Avatar asked Aug 27 '15 18:08

user2288349


2 Answers

Try to change console font to "Lucida Console"

Also you can try to run chcp 65001 in console. This command will set characters to UTF-8

If you get wrong displaying - install required font into system.

More details here

Actually the problem does not belongs to perl. It belongs to windows terminal. Try how it works in this console . YOu can log to some file binary data that was read from input and compare those two cases (terminal VS cygwin)

like image 115
Eugen Konkov Avatar answered Nov 14 '22 04:11

Eugen Konkov


This is a Microsoft bug. The Windows APIs ReadFile() and ReadConsoleA() always return 0 bytes read (which indicates EOF) on code page 65001. See this blog for details.
As Microsoft will not fix this, the only available answer is to tell the Perl maintainers to switch to using ReadConsoleW() and converting the resultant wide chars to utf-8 with WideCharToMultiByte(CP_UTF8, ...).

like image 21
MarkI Avatar answered Nov 14 '22 05:11

MarkI