I have a string containing unicode symbols (cyrillic):
myString1 = 'Австрия'
myString2 = 'AustriЯ'
I want to check if all the elements in the string are English (ASCII). Now I'm using a loop:
for char in myString1:
if ord(s) not in range(65,91):
break
So if I find the first non-English element I break the loop. But for the given example you can see the string can contain a lot of English symbols and unicode at the end. In this way I will check the whole string. Furthermore, If all the string is in English I still check every char.
Is there any more efficient way to do this? I'm thinking about something like:
if any(myString[:]) is not in range(65,91)
You can speed up the check by using a set
(O(1)
contains check), especially if you are checking multiple strings for the same range since the initial set creation requires one iteration as well. You can then use all
for the early-breaking iteration pattern which fits better than any
here:
import string
ascii = set(string.ascii_uppercase)
ascii_all = set(string.ascii_uppercase + string.ascii_lowercase)
if all(x in ascii for x in my_string1):
# my_string1 is all ascii
Of course, any all
construct can be transformed to an any
via DeMorgan's Law:
if not any(x not in ascii for x in my_string1):
# my_string1 is all ascii
One good pure set based approach not requiring a complete iteration as pointed out by Artyer:
if ascii.issuperset(my_string1):
# my_string1 is all ascii
Another way just as @schwobaseggl suggest but using full set methods:
import string
ascii = string.ascii_uppercase + string.ascii_lowercase
if set(my_string).issubset(ascii):
#myString is ascii
There's no way to avoid iterating.
However, you can certainly make it more efficient by doing not 65 <= ord(s) <= 91
rather than comparing against a range.
re
appears to be quite fast:
import re
# to check whether any outside ranges (->MatchObject) / all in ranges (->None)
nonletter = re.compile('[^a-zA-Z]').search
# to check whether any in ranges (->MatchObject) / all outside ranges (->None)
letter = re.compile('[a-zA-Z]').search
bool(nonletter(myString1))
# True
bool(nonletter(myString2))
# True
bool(nonletter(myString2[:-1]))
# False
Benchmarks for OP's two examples and a positive one (set is @schwobaseggl setset is @DanielSanchez):
Австрия
re 0.48832818 ± 0.09022105 µs
set 0.58745548 ± 0.01759877 µs
setset 0.81759223 ± 0.03595184 µs
AustriЯ
re 0.51960442 ± 0.01881561 µs
set 1.03043942 ± 0.02453405 µs
setset 0.54060076 ± 0.01505265 µs
tralala
re 0.27832978 ± 0.01462306 µs
set 0.88285526 ± 0.03792728 µs
setset 0.43238688 ± 0.01847240 µs
Benchmark code:
import types
from timeit import timeit
import re
import string
import numpy as np
def mnsd(trials):
return '{:1.8f} \u00b1 {:10.8f} \u00b5s'.format(np.mean(trials), np.std(trials))
nonletter = re.compile('[^a-zA-Z]').search
letterset = set(string.ascii_letters)
def f_re(stri):
return not nonletter(stri)
def f_set(stri):
return all(x in letterset for x in stri)
def f_setset(stri):
return set(stri).issubset(letterset)
for stri in ('Австрия', 'AustriЯ', 'tralala'):
ref = f_re(stri)
print(stri)
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
try:
assert ref == func(stri)
print("{:16s}".format(name[2:]), mnsd([timeit(
'f(stri)', globals={'f':func, 'stri':stri}, number=1000) * 1000 for i in range(1000)]))
except:
print("{:16s} apparently failed".format(name[2:]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With