I've got a series of text that is mostly English, but contains some phrases with Chinese characters. Here's two examples:
s1 = "You say: 你好. I say: 再見"
s2 = "答案, my friend, 在風在吹"
I'm trying to find each block of Chinese, apply a function which will translate the text (I already have a way to do the translation), then replace the translated text in the string. So the output would be something like this:
o1 = "You say: hello. I say: goodbye"
o2 = "The answer, my friend, is blowing in the wind"
I can find the Chinese characters easily by doing this:
utf_line = s1.decode('utf-8')
re.findall(ur'[\u4e00-\u9fff]+',utf_line)
...But I end up with a list of all the Chinese characters and no way of determining where each phrase begins and ends.
You could always use a in-place replace of the matched regular expression by using re.sub()
in python.
Try this:
print(re.sub(r'([\u4e00-\u9fff]+)', translate('\g<0>'), utf_line))
You can't get the indexes using re.findall(). You could use re.finditer() instead, and refer to m.group(), m.start() and m.end().
However, for your particular case, it seems more practical to call a function using re.sub().
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string
Code:
import re
s = "You say: 你好. I say: 再見. 答案, my friend, 在風在吹"
utf_line = s.decode('utf-8')
dict = {"你好" : "hello",
"再見" : "goodbye",
"答案" : "The answer",
"在風在吹" : "is blowing in the wind",
}
def translate(m):
block = m.group().encode('utf-8')
# Do your translation here
# this is just an example
if block in dict:
return dict[ block ]
else:
return "{unknown}"
utf_translated = re.sub(ur'[\u4e00-\u9fff]+', translate, utf_line, re.UNICODE)
print utf_translated.encode('utf-8')
Output:
You say: hello. I say: goodbye. The answer, my friend, is blowing in the wind
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With