Writing my code for Python 2.6, but with Python 3 in mind, I thought it was a good idea to put
from __future__ import unicode_literals
at the top of some modules. In other words, I am asking for troubles (to avoid them in the future), but I might be missing some important knowledge here. I want to be able to pass a string representing a filepath and instantiate an object as simple as
MyObject('H:\unittests')
In Python 2.6, this works just fine, no need to use double backslashes or a raw string, even for a directory starting with '\u..', which is exactly what I want. In the __init__ method I make sure all single \ occurences are interpreted as '\\', including those before special characters as in \a, \b, \f,\n, \r, \t and \v (only \x remains a problem). Also decoding the given string into unicode using (local) encoding works as expected.
Preparing for Python 3.x, simulating my actual problem in an editor (starting with a clean console in Python 2.6), the following happens:
>>> '\u'
'\\u'
>>> r'\u'
'\\u'
(OK until here: '\u' is encoded by the console using the local encoding)   
>>> from __future__ import unicode_literals
>>> '\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
In other words, the (unicode) string is not interpreted as unicode at all, nor does it get decoded automatically with the local encoding. Even so for a raw string:
>>> r'\u'
SyntaxError: (unicode error) 'rawunicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX
same for  u'\u':
>>> u'\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
Also, I would expect isinstance(str(''), unicode) to return True (which it does not), because importing unicode_literals should make all string-types unicode. (edit:) Because in Python 3, all strings are sequences of Unicode characters, I would expect str('')) to return such a unicode-string, and type(str('')) to be both  <type 'unicode'>, and  <type 'str'> (because all strings are unicode) but also realise that <type 'unicode'> is not <type 'str'>. Confusion all around...
Questions
\u'? (without writing '\\u')from __future__ import unicode_literals really implement all Python 3. related unicode changes so that I get a complete Python 3 string environment?edit:
In Python 3, <type 'str'> is a Unicode object and <type 'unicode'> simply does not exist. In my case I want to write code for Python 2(.6) that will work in Python 3. But when I import unicode_literals, I cannot check if a string is of <type 'unicode'> because:
unicode is not part of the namespaceunicode is part of the namespace, a literal of <type 'str'> is still unicode when it is created in the same moduletype(mystring) will always return <type 'str'> for unicode literals in Python 3My modules use to be encoded in 'utf-8' by a # coding: UTF-8 comment at the top, while my locale.getdefaultlocale()[1] returns 'cp1252'. So if I call MyObject('çça') from my console, it is encoded as 'cp1252' in Python 2, and in 'utf-8' when calling MyObject('çça') from the module. In Python 3, it will not be encoded, but a unicode literal.
edit:
I gave up hope about being allowed to avoid using '\' before a u (or x for that matter). Also I understand the limitations of importing unicode_literals. However, the many possible combinations of passing a string from a module to the console and vica versa with each different encoding, and on top of that importing unicode_literals or not and Python 2 vs Python 3, made me want to create an overview by actual testing. Hence the table below. 
In other words, type(str('')) does not return <type 'str'> in Python 3, but <class 'str'>, and all of Python 2 problems seem to be avoided. 
AFAIK, all that from __future__ import unicode_literals does is to make all string literals of unicode type, instead of string type. That is:
>>> type('')
<type 'str'>
>>> from __future__ import unicode_literals
>>> type('')
<type 'unicode'>
But str and unicode are still different types, and they behave just like before.
>>> type(str(''))
<type 'str'>
Always, is of str type.
About your r'\u' issue, it is by design, as it is equivalent to ru'\u' without unicode_literals. From the docs:
When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U' prefix, then the \uXXXX and \UXXXXXXXX escape sequences are processed while all other backslashes are left in the string.
Probably from the way the lexical analyzer worked in the python2 series. In python3 it works as you (and I) would expect.
You can type the backslash twice, and then the \u will not be interpreted, but you'll get two backslashes!
Backslashes can be escaped with a preceding backslash; however, both remain in the string
>>> ur'\\u'
u'\\\\u'
So IMHO, you have two simple options:
Do not use raw strings, and escape your backslashes (compatible with python3):
'H:\\unittests'
Be too smart and take advantage of unicode codepoints (not compatible with python3):
r'H:\u005cunittests'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With