Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Special characters appearing as question marks

Using the Python programming language, I'm having trouble outputting characters such as å, ä and ö. The following code gives me a question mark (?) as output, not an å:

#coding: iso-8859-1
input = "å"
print input

The following code lets you input random text. The for-loop goes through each character of the input, adds them to the string variable a and then outputs the resulting string. This code works correctly; you can input å, ä and ö and the output will still be correct. For example, "år" outputs "år" as expected.

#coding: iso-8859-1
input = raw_input("Test: ")
a = ""
for i in range(0, len(input)):
    a = a + input[i]
print a

What's interesting is that if I change input = raw_input("Test: ") to input = "år", it will output a question mark (?) for the "å".

#coding: iso-8859-1
input = "år"
a = ""
for i in range(0, len(input)):
     a = a + input[i]
print a

For what it's worth, I'm using TextWrangler, and my document's character encoding is set to ISO Latin 1. What causes this? How can I solve the problem?

like image 209
Måns Nilsson Avatar asked Oct 21 '22 20:10

Måns Nilsson


1 Answers

You're using Python 2, I assume running on a platform like Linux that encodes I/O in UTF-8.

Python 2's "" literals represent byte-strings. So when you specify "år" in your ISO 8859-1-encoded source file, the variable input has the value b'\xe5r'. When you print this, the raw bytes are output to the console, but show up as a question-mark because they are not valid UTF-8.

To demonstrate, try it with print repr(a) instead of print a.

When you use raw_input(), the user's input is already UTF-8-encoded, and so are correctly output.

To fix this, either:

  • Encode your string as UTF-8 before printing it:

    print a.encode('utf-8')
    
  • Use Unicode strings (u'text') instead of byte-strings. You will need to be careful with decoding the input, since on Python 2, raw_input() returns a byte-string rather than a text string. If you know the input is UTF-8, use raw_input().decode('utf-8').

  • Encode your source file in UTF-8 instead of iso-8859-1. Then the byte-string literal will already be in UTF-8.

like image 140
Mechanical snail Avatar answered Oct 23 '22 10:10

Mechanical snail