I have a string in unicode and I need to return the first N characters. I am doing this:
result = unistring[:5]
but of course the length of unicode strings != length of characters. Any ideas? The only solution is using re?
Edit: More info
unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]
returns-> ?
I think that unicode strings are two bytes (char), that's why this thing happens. If I do:
result = unistring[:2]
I get
M
which is correct, So, should I always slice*2 or should I convert to something?
The unicode string for \xe9 is an accented e - é
Use string slicing to get the first N characters of a string, e.g. first_n = string[:n] . The slicing operation will return a new string that starts at index 0 and contains the first N characters of the original string.
To extract the first two characters of a list in Python you can use [:2] which is the short version of [0:2].
Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (str
) and Unicode strings (unicode
).
Prior to the unification in Python 3.0 there are two ways to declare a string literal: unistring = "Μεταλλικα"
which is a byte string and unistring = u"Μεταλλικα"
which is a unicode string.
The reason you see ?
when you do result = unistring[:1]
is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.
So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO
When you say:
unistring = "Μεταλλικα" #Metallica written in Greek letters
You do not have a unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding:
unistring = "Μεταλλικα".decode('utf-8')
or by using the unicode literal in a source file with the right encoding declaration
# coding: UTF-8
unistring = u"Μεταλλικα"
The unicode string will do what you want when you do unistring[:5]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With