Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression and unicode utf-8 in python?

I have block of code:( Django code )

        list_temp = []
        tagname_re = re.compile(r'^[\w+\.-]+$', re.UNICODE)
        for key,tag in list.items():
            if len(tag) > settings.FORM_MAX_LENGTH_OF_TAG or len(tag) < settings.FORM_MIN_LENGTH_OF_TAG:
                raise forms.ValidationError(_('please use between %(min)s and %(max)s characters in you tags') % { 'min': settings.FORM_MIN_LENGTH_OF_TAG, 'max': settings.FORM_MAX_LENGTH_OF_TAG})
            if not tagname_re.match(tag):
                raise forms.ValidationError(_('please use following characters in tags: letters , numbers, and characters \'.-_\''))
            # only keep one same tag
            if tag not in list_temp and len(tag.strip()) > 0:
                list_temp.append(tag)

This allow me to put the tag name in Unicode character.

But I don't know why with my Unicode (khmer uncode Khmer Symbols Range: 19E0–19FF The Unicode Standard, Version 4.0).I could not .

My question :

How can I change the above codetagname_re = re.compile(r'^[\w+\.-]+$', re.UNICODE) to adapt my Unicode character.?Because if I input the tag with the "នយោបាយ" I got the message?

please use following characters in tags: letters , numbers, and characters \'.-_\''

like image 390
kn3l Avatar asked Mar 26 '11 08:03

kn3l


People also ask

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

How do I encode UTF-8 in Python?

Use str.Call str. encode() to encode str as UTF-8 bytes. Call bytes. decode() to decode UTF-8 encoded bytes to a Unicode string.

What is Unicode encoding in Python?

Since Python 3.0, strings are stored as Unicode, i.e. each character in the string is represented by a code point. So, each string is just a sequence of Unicode code points. For efficient storage of these strings, the sequence of code points is converted into a set of bytes. The process is known as encoding.


1 Answers

ោ (U+17C4 KHMER VOWEL SIGN OO) and ា (U+17B6 KHMER VOWEL SIGN AA) are not letters, they're combining marks, so they don't match \w.

like image 76
bobince Avatar answered Sep 22 '22 01:09

bobince