Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how do I specify extended ascii (i.e. range(256)) in the python magic encoding specifier line?

I'm using mako templates to generate specialized config files. Some of these files contain extended ASCII chars (>127), but mako chokes saying that the chars are out of range when I use:

## -*- coding: ascii -*-

So I'm wondering if perhaps there's something like:

## -*- coding: eascii -*-

That I can use that will be ok with the range(128, 256) chars.

EDIT:

Here's the dump of the offending section of the file:

000001b0  39 c0 c1 c2 c3 c4 c5 c6  c7 c8 c9 ca cb cc cd ce  |9...............|
000001c0  cf d0 d1 d2 d3 d4 d5 d6  d7 d8 d9 da db dc dd de  |................|
000001d0  df e0 e1 e2 e3 e4 e5 e6  e7 e8 e9 ea eb ec ed ee  |................|
000001e0  ef f0 f1 f2 f3 f4 f5 f6  f7 f8 f9 fa fb fc fd fe  |................|
000001f0  ff 5d 2b 28 27 73 29 3f  22 0a 20 20 20 20 20 20  |.]+('s)?".      |
00000200  20 20 74 6f 6b 65 6e 3a  20 57 4f 52 44 20 20 20  |  token: WORD   |
00000210  20 20 22 5b 41 2d 5a 61  2d 7a 30 2d 39 c0 c1 c2  |  "[A-Za-z0-9...|
00000220  c3 c4 c5 c6 c7 c8 c9 ca  cb cc cd ce cf d0 d1 d2  |................|
00000230  d3 d4 d5 d6 d7 d8 d9 da  db dc dd de df e0 e1 e2  |................|
00000240  e3 e4 e5 e6 e7 e8 e9 ea  eb ec ed ee ef f0 f1 f2  |................|
00000250  f3 f4 f5 f6 f7 f8 f9 fa  fb fc fd fe ff 5d 2b 28  |.............]+(|

The first character that mako complains about is 000001b4. If I remove this section, everything works fine. With the section inserted, mako complains:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

It's the same complaint whether I use 'ascii' or 'latin-1' in the magic comment line.

Thanks!

Greg

like image 538
gred Avatar asked Jul 27 '11 20:07

gred


People also ask

How do I type extended ASCII?

On a standard 101 keyboard, special extended ASCII characters such as é or ß can be typed by holding the ALT key and typing the corresponding 4 digit ASCII code. For example é is typed by holding the ALT key and typing 0233 on the keypad.

Does Python support extended ASCII?

There is no such thing as "extended ASCII", there are many different encodings in which 247 can mean different things. You need to decode the string with the right encoding.

How do I print the ASCII symbol in Python?

Here are few methods in different programming languages to print ASCII value of a given character : Python code using ord function : ord() : It converts the given string of length one, returns an integer representing the Unicode code point of the character. For example, ord('a') returns the integer 97.

Does UTF-8 support extended ASCII?

Part of the genius of UTF-8 is that ASCII can be considered a 7-bit encoding scheme for a very small subset of Unicode/UCS, and seven-bit ASCII (when prefixed with 0 as the high-order bit) is valid UTF-8. Thus it follows that UTF-8 cannot collide with ASCII. But UTF-8 can and does collide with Extended-ASCII.


3 Answers

Short answer

Use cp437 as the encoding for some retro DOS fun. All byte values greater than or equal to 32 decimal, except 127, are mapped to displayable characters in this encoding. Then use cp037 as the encoding for a truly trippy time. And then ask yourself how do you really know which of these, if either of them, is "correct".

Long answer

There is something you must unlearn: the absolute equivalence of byte values and characters.

Many basic text editors and debugging tools today, and also the Python language specification, imply an absolute equivalence between bytes and characters when in reality none exists. It is not true that 74 6f 6b 65 6e is "token". Only for ASCII-compatible character encodings is this correspondence valid. In EBCDIC, which is still quite common today, "token" corresponds to byte values a3 96 92 85 95.

So while the Python 2.6 interpreter happily evaluates 'text' == u'text' as True, it shouldn't, because they are only equivalent under the assumption of ASCII or a compatible encoding, and even then they should not be considered equal. (At least '\xfd' == u'\xfd' is False and gets you a warning for trying.) Python 3.1 evaluates 'text' == b'text' as False. But even the acceptance of this expression by the interpreter implies an absolute equivalence of byte values and characters, because the expression b'text' is taken to mean "the byte-string you get when you apply the ASCII encoding to 'text'" by the interpreter.

As far as I know, every programming language in widespread use today carries an implicit use of ASCII or ISO-8859-1 (Latin-1) character encoding somewhere in its design. In C, the char data type is really a byte. I saw one Java 1.4 VM where the constructor java.lang.String(byte[] data) assumed ISO-8859-1 encoding. Most compilers and interpreters assume ASCII or ISO-8859-1 encoding of source code (some let you change it). In Java, string length is really the UTF-16 code unit length, which is arguably wrong for characters U+10000 and above. In Unix, filenames are byte-strings interpreted according to terminal settings, allowing you to open('a\x08b', 'w').write('Say my name!').

So we have all been trained and conditioned by the tools we have learned to trust, to believe that 'A' is 0x41. But it isn't. 'A' is a character and 0x41 is a byte and they are simply not equal.

Once you have become enlightened on this point, you will have no trouble resolving your issue. You have simply to decide what component in the software is assuming the ASCII encoding for these byte values, and how to either change that behavior or ensure that different byte values appear instead.

PS: The phrases "extended ASCII" and "ANSI character set" are misnomers.

like image 189
wberry Avatar answered Oct 07 '22 00:10

wberry


Try

## -*- coding: UTF-8 -*-

or

## -*- coding: latin-1 -*-

or

## -*- coding: cp1252 -*-

depending on what you really need. The last two are similar except:

The Windows-1252 codepage coincides with ISO-8859-1 for all codes except the range 128 to 159 (hex 80 to 9F), where the little-used C1 controls are replaced with additional characters. Windows-28591 is the actual ISO-8859-1 codepage.

where ISO-8859-1 is the official name for latin-1.

like image 27
agf Avatar answered Oct 07 '22 01:10

agf


Try examining your data with a critical eye:

000001b0 39 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce |9...............|
000001c0 cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de |................|
000001d0 df e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee |................|
000001e0 ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe |................|
000001f0 ff 5d 2b 28 27 73 29 3f 22 0a 20 20 20 20 20 20 |.]+('s)?". |
00000200 20 20 74 6f 6b 65 6e 3a 20 57 4f 52 44 20 20 20 | token: WORD |
00000210 20 20 22 5b 41 2d 5a 61 2d 7a 30 2d 39 c0 c1 c2 | "[A-Za-z0-9...|
00000220 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf d0 d1 d2 |................|
00000230 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0 e1 e2 |................|
00000240 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 |................|
00000250 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff 5d 2b 28 |.............]+(|

The stuff in bold font is two lots of (each byte from 0xc0 to 0xff both inclusive). You appear to have a binary file (perhaps a dump of compiled regex(es)), not a text file. I suggest that you read it as a binary file, rather than paste it into your Python source file. You should also read the mako docs to find out what it is expecting.

Update after eyeballing the text part of your dump: You may well be able to express this in ASCII-only regexes e.g. you would have a line containing

token: WORD "[A-Za-z0-9\xc0-\xff]+(etc)etc"
like image 37
John Machin Avatar answered Oct 07 '22 00:10

John Machin