Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: find a series of Chinese characters within a string and apply a function

Tags:

python

regex

I've got a series of text that is mostly English, but contains some phrases with Chinese characters. Here's two examples:

s1 = "You say: 你好. I say: 再見"
s2 = "答案, my friend, 在風在吹"

I'm trying to find each block of Chinese, apply a function which will translate the text (I already have a way to do the translation), then replace the translated text in the string. So the output would be something like this:

o1 = "You say: hello. I say: goodbye"
o2 = "The answer, my friend, is blowing in the wind"

I can find the Chinese characters easily by doing this:

utf_line = s1.decode('utf-8') 
re.findall(ur'[\u4e00-\u9fff]+',utf_line)

...But I end up with a list of all the Chinese characters and no way of determining where each phrase begins and ends.

like image 279
cyril Avatar asked Dec 18 '22 12:12

cyril


2 Answers

You could always use a in-place replace of the matched regular expression by using re.sub() in python.

Try this:

print(re.sub(r'([\u4e00-\u9fff]+)', translate('\g<0>'), utf_line))
like image 165
Ashan Avatar answered Dec 21 '22 02:12

Ashan


You can't get the indexes using re.findall(). You could use re.finditer() instead, and refer to m.group(), m.start() and m.end().

However, for your particular case, it seems more practical to call a function using re.sub().

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string

Code:

import re

s = "You say: 你好. I say: 再見. 答案, my friend, 在風在吹"
utf_line = s.decode('utf-8')

dict = {"你好" : "hello",
        "再見" : "goodbye",
        "答案" : "The answer",
        "在風在吹" : "is blowing in the wind",
       }

def translate(m):
    block = m.group().encode('utf-8')
    # Do your translation here

    # this is just an example
    if block in dict:
        return dict[ block ]
    else:
        return "{unknown}"


utf_translated = re.sub(ur'[\u4e00-\u9fff]+', translate, utf_line, re.UNICODE)

print utf_translated.encode('utf-8')

Output:

You say: hello. I say: goodbye. The answer, my friend, is blowing in the wind
  • Ideone demo
like image 23
Mariano Avatar answered Dec 21 '22 02:12

Mariano