Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match capital/special/unicode/vietnamese characters

Tags:

python

regex

I'm facing an issue. Indeed, I work with vietnamese texts and I want to find every word containing uppercase(s) (capital letter). When I use the 're' module, my function (temp) does not catch word like "Đà". The other way (temp2) is to check each character at a time, it works but it is slow since I have to split the sentences into words.

Hence I would like to know if there is a way of the "re" module to catch all the special capital letter.

I have 2 ways :

def temp(sentence):
    return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)


lis=word_tokenize(sentence)
def temp2(lis):
    proper_noun=[]
    for word in lis:
        for letter in word:
            if letter.isupper():
                proper_noun.append(word)
                break
    return proper_noun

Input:

'nous avons 2 Đồng et 3 Euro'

Expected output :

['Đồng','Euro']

Thank you!

like image 303
huseyin39 Avatar asked Jun 26 '18 04:06

huseyin39


People also ask

How to match capital letters in regex?

Using character sets For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter. In a character set a ^ character negates the following characters.

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

What is S and W in regex?

On the other hand, the \S+ (uppercase S ) matches anything that is NOT matched by \s , i.e., non-whitespace. In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit.


1 Answers

You may use this regex:

\b\S*[AĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴA-Z]+\S*\b

Regex Demo

like image 52
Rizwan M.Tuman Avatar answered Sep 24 '22 15:09

Rizwan M.Tuman