I want to remove part of a string (shown in bold) below, this is stored in the string oldString
[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY
im using the following regex within python
p=re.compile(ur"( [\W]+) (?=[A-Za-z ]+–)", re.UNICODE)
newString=p.sub("", oldString)
when i output the newString nothing has been removed
Character encodings. There are several standard methods to encode Japanese characters for use on a computer, including JIS, Shift-JIS, EUC, and Unicode.
The default encoding for Python 2 files is ASCII, so by declaring an encoding you make it possible to use Japanese directly. I encoded 'ル' to UTF-16 little-endian because that's the default Windows NTFS filename encoding.
The Difference Between \s and \s+ For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.
You can use the following snippet to solve the issue:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
str = u'[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY'
regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[A-Za-z ]+–)'
p = re.compile(regex, re.U)
match = p.sub("", str)
print match.encode("UTF-8")
See IDEONE demo
Beside # -*- coding: utf-8 -*-
declaration, I have added @nhahtdh's character class to detect Japanese symbols.
Note that the match
needs to be encoded as UTF-8 string "manually" since Python 2 needs to be "reminded" we are working with Unicode all the time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With