So I have a string:
amélie
In bytes it is b'ame\xcc\x81lie'
In utf-8 the character is combining acute accent for the previous character http://www.fileformat.info/info/unicode/char/0301/index.htm
u'ame\u0301lie'
When I do: 'amélie'.title() on that string, I get 'AméLie', which makes no sense to me.
I know I can do a workaround, but is this intended behavior or a bug? I would expect the "l" to NOT get capitalized.
another experiment:
In [1]: [ord(c) for c in 'amélie'.title()]
Out[1]: [65, 109, 101, 769, 76, 105, 101]
In [2]: [ord(c) for c in 'amélie']
Out[2]: [97, 109, 101, 769, 108, 105, 101]
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
Definition and Usage. The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.
By default, Python uses utf-8 encoding.
Take a look at these questions: Python title() with apostrophes and Titlecasing a string with exceptions
Basically it looks like a limitation of the inbuilt title
function which seems to be very liberal about what it considers a word boundary.
You can use string.capwords
:
import string
string.capwords('amélie')
Out[18]: 'Amélie'
Another thing you could do is use the character é ('\xc3\xa9'
) which is an e
with accent built in:
b'am\xc3\xa9lie'.decode().title()
Out[21]: 'Amélie'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With