Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if any (all) character of a string is in a given range

I have a string containing unicode symbols (cyrillic):

myString1 = 'Австрия'
myString2 = 'AustriЯ'

I want to check if all the elements in the string are English (ASCII). Now I'm using a loop:

for char in myString1:
    if ord(s) not in range(65,91):
         break

So if I find the first non-English element I break the loop. But for the given example you can see the string can contain a lot of English symbols and unicode at the end. In this way I will check the whole string. Furthermore, If all the string is in English I still check every char.

Is there any more efficient way to do this? I'm thinking about something like:

if any(myString[:]) is not in range(65,91)
like image 228
Mikhail_Sam Avatar asked Dec 26 '17 09:12

Mikhail_Sam


Video Answer


4 Answers

You can speed up the check by using a set (O(1) contains check), especially if you are checking multiple strings for the same range since the initial set creation requires one iteration as well. You can then use all for the early-breaking iteration pattern which fits better than any here:

import string

ascii = set(string.ascii_uppercase)
ascii_all = set(string.ascii_uppercase + string.ascii_lowercase)

if all(x in ascii for x in my_string1):
    # my_string1 is all ascii

Of course, any all construct can be transformed to an any via DeMorgan's Law:

if not any(x not in ascii for x in my_string1):
    # my_string1 is all ascii

Update:

One good pure set based approach not requiring a complete iteration as pointed out by Artyer:

if ascii.issuperset(my_string1):
    # my_string1 is all ascii
like image 171
user2390182 Avatar answered Nov 13 '22 04:11

user2390182


Another way just as @schwobaseggl suggest but using full set methods:

import string
ascii = string.ascii_uppercase + string.ascii_lowercase
if set(my_string).issubset(ascii):
    #myString is ascii
like image 26
Netwave Avatar answered Nov 13 '22 03:11

Netwave


There's no way to avoid iterating. However, you can certainly make it more efficient by doing not 65 <= ord(s) <= 91 rather than comparing against a range.

like image 38
Daniel Roseman Avatar answered Nov 13 '22 04:11

Daniel Roseman


re appears to be quite fast:

import re

# to check whether any outside ranges (->MatchObject) / all in ranges (->None)
nonletter = re.compile('[^a-zA-Z]').search

# to check whether any in ranges (->MatchObject) / all outside ranges (->None)
letter = re.compile('[a-zA-Z]').search

bool(nonletter(myString1))
# True

bool(nonletter(myString2))
# True

bool(nonletter(myString2[:-1]))
# False

Benchmarks for OP's two examples and a positive one (set is @schwobaseggl setset is @DanielSanchez):

Австрия
re               0.48832818 ± 0.09022105 µs
set              0.58745548 ± 0.01759877 µs
setset           0.81759223 ± 0.03595184 µs
AustriЯ
re               0.51960442 ± 0.01881561 µs
set              1.03043942 ± 0.02453405 µs
setset           0.54060076 ± 0.01505265 µs
tralala
re               0.27832978 ± 0.01462306 µs
set              0.88285526 ± 0.03792728 µs
setset           0.43238688 ± 0.01847240 µs

Benchmark code:

import types
from timeit import timeit
import re
import string
import numpy as np

def mnsd(trials):
    return '{:1.8f} \u00b1 {:10.8f} \u00b5s'.format(np.mean(trials), np.std(trials))

nonletter = re.compile('[^a-zA-Z]').search
letterset = set(string.ascii_letters)

def f_re(stri):
    return not nonletter(stri)

def f_set(stri):
    return all(x in letterset for x in stri)

def f_setset(stri):
    return set(stri).issubset(letterset)

for stri in ('Австрия', 'AustriЯ', 'tralala'):
    ref = f_re(stri)
    print(stri)
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        try:
            assert ref == func(stri)
            print("{:16s}".format(name[2:]), mnsd([timeit(
                'f(stri)', globals={'f':func, 'stri':stri}, number=1000) * 1000 for i in range(1000)]))

        except:
            print("{:16s} apparently failed".format(name[2:]))
like image 23
Paul Panzer Avatar answered Nov 13 '22 04:11

Paul Panzer