I am working on some code that has to manipulate unicode strings. I am trying to write doctests for it, but am having trouble. The following is a minimal example that illustrates the problem:
# -*- coding: utf-8 -*-
def mylen(word):
"""
>>> mylen(u"áéíóú")
5
"""
return len(word)
print mylen(u"áéíóú")
First we run the code to see the expected output of print mylen(u"áéíóú")
.
$ python mylen.py
5
Next, we run doctest on it to see the problem.
$ python -m
5
**********************************************************************
File "mylen.py", line 4, in mylen.mylen
Failed example:
mylen(u"áéíóú")
Expected:
5
Got:
10
**********************************************************************
1 items had failures:
1 of 1 in mylen.mylen
***Test Failed*** 1 failures.
How then can I test that mylen(u"áéíóú")
evaluates to 5?
To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.
Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII.
If you want unicode strings, you have to use unicode docstrings! Mind the u
!
# -*- coding: utf-8 -*-
def mylen(word):
u""" <----- SEE 'u' HERE
>>> mylen(u"áéíóú")
5
"""
return len(word)
print mylen(u"áéíóú")
This will work -- as long as the tests pass. For Python 2.x you need yet another hack to make verbose doctest mode work or get correct tracebacks when tests fail:
if __name__ == "__main__":
import sys
reload(sys)
sys.setdefaultencoding("UTF-8")
import doctest
doctest.testmod()
NB! Only ever use setdefaultencoding for debug purposes. I'd accept it for doctest use, but not anywhere in your production code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With