Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?

Right now my code is this:

for index, paragraph in enumerate(intro[2:-2]):
    intro[index] = bold_letters(paragraph, 1)

def bold_letters(string, index):
    return "<b>"+string[0]+"</b>"+string[index:]

And I'm getting output like this:

<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים. 

It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.

Example desired output (hebrew goes right to left):

>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"

BTW, this is for Python 2

like image 746
Ester Lin Avatar asked Aug 30 '16 14:08

Ester Lin


Video Answer


1 Answers

You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).

To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.

Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.

def bold_letters(string, index):
    string = string.decode('utf8')
    string "<b>"+string[0]+"</b>"+string[index:]
    return string.encode('utf8')

This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html

Python 3.x String is a Unicode object so you don't have to explicitly do anything.

like image 186
Nishant Avatar answered Sep 21 '22 00:09

Nishant