Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.7: regex - match any letter from any language

Tags:

python

regex

I have tried to understand how to check whether the string contains only letters (from any language) in Python 2.7. I have tried this code:

# -*- coding: utf-8 -*-
import re

def main():
    regexp1 = re.compile('[^\W\d_]+', re.IGNORECASE | re.UNICODE)
    regexp2 = re.compile('[\p{L}]+', re.IGNORECASE | re.UNICODE)

    print("1", regexp1.search(u"test"))
    print("2", regexp1.search(u'äö'))
    print("3", regexp1.search(u'...'))
    print("4", regexp1.search(u'9a'))
    print("5", regexp1.search(u'New / York'))

    print("6", regexp2.search(u"test"))
    print("7", regexp2.search(u'äö'))
    print("8", regexp2.search(u'...'))
    print("9", regexp2.search(u'9a'))
    print("10", regexp2.search(u'New / York'))

if __name__ == '__main__':
    main()

Output:

('1', <_sre.SRE_Match object at 0x02ACF678>)
('2', <_sre.SRE_Match object at 0x02ACF678>)
('3', None)
('4', <_sre.SRE_Match object at 0x02ACF678>)
('5', <_sre.SRE_Match object at 0x02ACF678>)
('1', None)
('2', None)
('3', None)
('4', None)
('5', None)

I want a regex that will match only string №1 and string №2 (only strings with letters from any language). But now it matches strings which contains letters (and also contains digits and /).

Also I have tried to use \p{L} regex, but it does not work at all. I have tried this regexes: [\p{L}]+, (\p{L})+, \p{L}.

like image 763
Gooman Avatar asked Aug 31 '25 18:08

Gooman


2 Answers

regexp1 is a good start. The problem is that regexp1 matches strings that contain at least one letter, not strings that contain only letters. Try this:

regexp1 = re.compile('^[^\W\d_]+$', re.IGNORECASE | re.UNICODE)

This "anchors" the match both to the beginning and to the end of the string, meaning that it won't be able to just match the "New" part of "New / York".

The python re module doesn't seem to have any support for character classes like \p{L}, but there is a third party regex module that does. See the docs at https://pypi.python.org/pypi/regex/ However, I can't speak to the performance or standards-compliance of that module.

like image 137
Dan Avatar answered Sep 02 '25 08:09

Dan


The third-party regex module is recommended in the re docs for more functionality and better Unicode support. Particularly, it supports \p patterns, so

\p{L}+

should work fine with regex regexes, matching any sequence of Unicode letter characters.

However, you should be cautious - a combining diacritic is not a letter. You can alter your regex to accept combining marks, or normalize your input in NFC form to combine some combining marks into the preceding letter, but first, you should think very carefully about your definition of "contains only letters".

Also, search only checks whether the string contains a match for the regex, not whether the entire string matches the regex. I would recommend fullmatch for matching the entire string, but that's only in Python 3.4+. For 2.7, I would say to anchor the regex:

^\p{L}+$

except that $ can match right before a trailing newline, so you should still examine the match object to see if it represents a whole-string match or if it stops before a trailing newline.

like image 29
user2357112 supports Monica Avatar answered Sep 02 '25 07:09

user2357112 supports Monica