How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

Tags:

I'm using Python and Django, but I'm having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8 implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4; and, someday in future, utf8 might support it as well.

But my server is not ready to upgrade to MySQL 5.5, and thus I'm limited to UTF-8 characters that take 3 bytes or less.

My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?

I want to replace all 4-byte characters with the official \ufffd (U+FFFD REPLACEMENT CHARACTER), or with ?.

In other words, I want a behavior quite similar to Python's own str.encode() method (when passing 'replace' parameter). Edit: I want a behavior similar to encode(), but I don't want to actually encode the string. I want to still have an unicode string after filtering.

I DON'T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.

[EDIT] Added tests about the proposed solutions

So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.

#!/usr/bin/env python # -*- coding: utf-8 -*- # vi:ts=4 sw=4 et  import cProfile import random import re  # How many times to repeat each filtering repeat_count = 256  # Percentage of "normal" chars, when compared to "large" unicode chars normal_chars = 90  # Total number of characters in this string string_size = 8 * 1024  # Generating a random testing string test_string = u''.join(         unichr(random.randrange(32,             0x10ffff if random.randrange(100) > normal_chars else 0x0fff         )) for i in xrange(string_size) )  # RegEx to find invalid characters re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)  def filter_using_re(unicode_string):     return re_pattern.sub(u'\uFFFD', unicode_string)  def filter_using_python(unicode_string):     return u''.join(         uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'         for uc in unicode_string     )  def repeat_test(func, unicode_string):     for i in xrange(repeat_count):         tmp = func(unicode_string)  print '='*10 + ' filter_using_re() ' + '='*10 cProfile.run('repeat_test(filter_using_re, test_string)') print '='*10 + ' filter_using_python() ' + '='*10 cProfile.run('repeat_test(filter_using_python, test_string)')  #print test_string.encode('utf8') #print filter_using_re(test_string).encode('utf8') #print filter_using_python(test_string).encode('utf8')

The results:

filter_using_re() did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at the sub() built-in)
filter_using_python() did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at the join() call and 1.900 CPU seconds evaluating the generator expression)
I did no test using itertools because... well... that solution, although interesting, was quite big and complex.

Conclusion

The RegEx solution was, by far, the fastest one.

508

asked Jul 10 '10 16:07

Denilson Sá Maia

2 Answers

Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.

pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE) pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)

Edit adding Python from Denilson Sá's script in the question body:

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE) filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)

136

answered Sep 27 '22 21:09

drawnonward

You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:

#1-byte characters have the following format: 0xxxxxxx #2-byte characters have the following format: 110xxxxx 10xxxxxx #3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx #4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:

def filter_4byte_chars(s):     i = 0     j = len(s)     # you need to convert     # the immutable string     # to a mutable list first     s = list(s)     while i < j:         # get the value of this byte         k = ord(s[i])         # this is a 1-byte character, skip to the next byte         if k <= 127:             i += 1         # this is a 2-byte character, skip ahead by 2 bytes         elif k < 224:             i += 2         # this is a 3-byte character, skip ahead by 3 bytes         elif k < 240:             i += 3         # this is a 4-byte character, remove it and update         # the length of the string we need to check         else:             s[i:i+4] = []             j -= 4     return ''.join(s)

Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.

answered Sep 27 '22 21:09

kasioumis

Related questions
                            
                                Should 3.4 enums use UPPER_CASE_WITH_UNDERSCORES?
                            
                                Can json.loads ignore trailing commas?
                            
                                Python : terminology 'class' VS 'type'
                            
                                Is django prefetch_related supposed to work with GenericRelation
                            
                                Why is Python 3 is considerably slower than Python 2? [duplicate]
                            
                                Performance of Redis vs Disk in caching application
                            
                                What is the global default timeout
                            
                                What Kivy Tutorials Are Available [closed]
                            
                                Is there a way to access the original function in a mocked method/function such that I can modify the arguments and pass it to the original functions?
                            
                                How can I print the values of Keras tensors?
                            
                                Does string slicing perform copy in memory? [duplicate]
                            
                                Chrome extension in python?
                            
                                Tools for static type checking in Python
                            
                                Unpacking generalizations
                            
                                Understanding the `ngram_range` argument in a CountVectorizer in sklearn
                            
                                Pandas slicing FutureWarning with 0.21.0
                            
                                Pytorch - RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed
                            
                                Control formatting of the argparse help argument list?
                            
                                How to install local packages using pip as part of a docker build?
                            
                                Change Django Templates Based on User-Agent

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

Tags:

python

mysql

unicode

django