How can I check if a Python unicode string contains non-Western letters?

Tags:

I have a Python Unicode string. I want to make sure it only contains letters from the Roman alphabet (A through Z), as well as letters commonly found in European alphabets, such as ß, ü, ø, é, à, and î. It should not contain characters from other alphabets (Chinese, Japanese, Korean, Arabic, Cyrillic, Hebrew, etc.). What's the best way to go about doing this?

Currently I am using this bit of code, but I don't know if it's the best way:

def only_roman_chars(s):
    try:
        s.encode("iso-8859-1")
        return True
    except UnicodeDecodeError:
        return False

(I am using Python 2.5. I am also doing this in Django, so if the Django framework happens to have a way to handle such strings, I can use that functionality -- I haven't come across anything like that, however.)

593

asked Jun 22 '10 15:06

mipadi

6 Answers

import unicodedata as ud

latin_letters= {}

def is_latin(uchr):
    try: return latin_letters[uchr]
    except KeyError:
         return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))

def only_roman_chars(unistr):
    return all(is_latin(uchr)
           for uchr in unistr
           if uchr.isalpha()) # isalpha suggested by John Machin

>>> only_roman_chars(u"ελληνικά means greek")
False
>>> only_roman_chars(u"frappé")
True
>>> only_roman_chars(u"hôtel lœwe")
True
>>> only_roman_chars(u"123 ångstrom ð áß")
True
>>> only_roman_chars(u"russian: гага")
False

answered Sep 28 '22 11:09

tzot

The top answer to this by @tzot is great, but IMO there should really be a library for this that works for all scripts. So, I made one (heavily based on that answer).

pip install alphabet-detector

and then use it directly:

from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()

ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #False
ad.only_alphabet_chars(u"ελληνικά", "GREEK") #True
ad.only_alphabet_chars(u'سماوي يدور', 'ARABIC')
ad.only_alphabet_chars(u'שלום', 'HEBREW')
ad.only_alphabet_chars(u"frappé", "LATIN") #True
ad.only_alphabet_chars(u"hôtel lœwe 67", "LATIN") #True
ad.only_alphabet_chars(u"det forårsaker første", "LATIN") #True
ad.only_alphabet_chars(u"Cyrillic and кириллический", "LATIN") #False
ad.only_alphabet_chars(u"кириллический", "CYRILLIC") #True

Also, a few convenience methods for major languages:

ad.is_cyrillic(u"Поиск") #True  
ad.is_latin(u"howdy") #True
ad.is_cjk(u"hi") #False
ad.is_cjk(u'汉字') #True

answered Sep 28 '22 12:09

Eli

The standard string package contains all Latin letters, numbers and symbols. You can remove these values from the text and if there is anything left, it is not-Latin characters. I did that:

In [1]: from string import printable                                                                                                                                                                           

In [2]: def is_latin(text): 
   ...:     return not bool(set(text) - set(printable)) 
   ...:                                                                                                                                                                                                        

In [3]: is_latin('Hradec Králové District,,Czech Republic,')                                                                                                                                                   
Out[3]: False

In [4]: is_latin('Hradec Krlov District,,Czech Republic,')                                                                                                                                                     
Out[4]: True

I have no way to check all non-Latin characters and if anyone can do that, please let me know. Thanks.

answered Sep 28 '22 11:09

Alexander Astashov

For what you say you want to do, your approach is about right. If you are running on Windows, I'd suggest using cp1252 instead of iso-8859-1. You might also allow cp1250 as well -- this would pick up eastern European countries like Poland, Czech Republic, Slovakia, Romania, Slovenia, Hungary, Croatia, etc where the alphabet is Latin-based. Other cp125x would include Turkish and Maltese ...

You may also like to consider transcription from Cyrillic to Latin; as far as I know there are several systems, one of which may be endorsed by the UPU (Universal Postal Union).

I'm a little intrigued by your comment "Our shipping department doesn't want to have to fill out labels with, e.g., Chinese addresses" ... three questions: (1) do you mean "addresses in country X" or "addresses written in X-ese characters" (2) wouldn't it be better for your system to print the labels? (3) how does the order get shipped if it fails your test?

answered Sep 28 '22 11:09

John Machin

Checking for ISO-8559-1 would miss reasonable Western characters like 'œ' and '€'. The solution depends on how you define "Western", and how you want to handle non-letters. Here's one approach:

import unicodedata

def is_permitted_char(char):
    cat = unicodedata.category(char)[0]
    if cat == 'L': # Letter
        return 'LATIN' in unicodedata.name(char, '').split()
    elif cat == 'N': # Number
        # Only DIGIT ZERO - DIGIT NINE are allowed
        return '0' <= char <= '9'
    elif cat in ('S', 'P', 'Z'): # Symbol, Punctuation, or Space
        return True
    else:
        return False

def is_valid(text):
    return all(is_permitted_char(c) for c in text)

answered Sep 28 '22 10:09

dan04

check the code in django.template.defaultfilters.slugify

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')

is what you are looking for, you can then compare the resulting string with the original

answered Sep 28 '22 12:09

Claude Vedovini

Related questions
                            
                                Overlay an image segmentation with numpy and matplotlib
                            
                                Does the for/in loop construct preserve order?
                            
                                Execute php code in Python
                            
                                Trying to drop NaN indexed row in dataframe
                            
                                How can I insert a new tag into a BeautifulSoup object?
                            
                                Fabric - Is there any way to capture run stdout?
                            
                                MVC design with Qt Designer and PyQt / PySide
                            
                                PyCharm hangs on 'scanning files to index' background task
                            
                                How to solve import error for pandas?
                            
                                Changing variable names with Python for loops [duplicate]
                            
                                Elegant way to remove fields from nested dictionaries
                            
                                Remove all javascript tags and style tags from html with python and the lxml module
                            
                                Have Supervisord Periodically restart child processes
                            
                                What is the best way to check if a tuple has any empty/None values in Python?
                            
                                Embedding a Plotly chart in a Django template
                            
                                Sort dataframe by string length
                            
                                What is the difference between sqlite3 and sqlalchemy?
                            
                                Print Variable In Jupyter Notebook Markdown Cell Python
                            
                                python select specific elements from a list
                            
                                Exclude object's field from pickling in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I check if a Python unicode string contains non-Western letters?

Tags:

python

unicode

django

mipadi

People also ask

6 Answers

tzot

Eli

Alexander Astashov

John Machin

dan04

Claude Vedovini

Recent Activity

Donate For Us