Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Returning the first N characters of a unicode string

I have a string in unicode and I need to return the first N characters. I am doing this:

result = unistring[:5]

but of course the length of unicode strings != length of characters. Any ideas? The only solution is using re?

Edit: More info

unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]

returns-> ?

I think that unicode strings are two bytes (char), that's why this thing happens. If I do:

result = unistring[:2]

I get

M

which is correct, So, should I always slice*2 or should I convert to something?

like image 469
Jon Romero Avatar asked Jan 28 '10 10:01

Jon Romero


People also ask

What is character u '\ xe9?

The unicode string for \xe9 is an accented e - é

How do you print the first n characters of a string in Python?

Use string slicing to get the first N characters of a string, e.g. first_n = string[:n] . The slicing operation will return a new string that starts at index 0 and contains the first N characters of the original string.

How do you extract the first two characters in Python?

To extract the first two characters of a list in Python you can use [:2] which is the short version of [0:2].


2 Answers

Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (str) and Unicode strings (unicode).

Prior to the unification in Python 3.0 there are two ways to declare a string literal: unistring = "Μεταλλικα" which is a byte string and unistring = u"Μεταλλικα" which is a unicode string.

The reason you see ? when you do result = unistring[:1] is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.

So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO

like image 89
Tendayi Mawushe Avatar answered Oct 19 '22 20:10

Tendayi Mawushe


When you say:

unistring = "Μεταλλικα" #Metallica written in Greek letters

You do not have a unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding:

unistring = "Μεταλλικα".decode('utf-8')

or by using the unicode literal in a source file with the right encoding declaration

# coding: UTF-8
unistring = u"Μεταλλικα"

The unicode string will do what you want when you do unistring[:5].

like image 31
Thomas Wouters Avatar answered Oct 19 '22 21:10

Thomas Wouters