I know how to do this if I iterate through all of the characters in the string but I am looking for a more elegant method.

A regular expression will do the trick with very little code: <pre class="prettyprint"><code>import re ... if re.match("^[A-Za-z0-9_-]*$", my_little_string): # do something here </code></pre>

[Edit] There's another solution not mentioned yet, and it seems to outperform the others given so far in most cases. Use string.translate to replace all valid characters in the string, and see if we have any invalid ones left over. This is pretty fast as it uses the underlying C function to do the work, with very little python bytecode involved. Obviously performance isn't everything - going for the most readable solutions is probably the best approach when not in a performance critical codepath, but just to see how the solutions stack up, here's a performance comparison of all the methods proposed so far. check_trans is the one using the string.translate method. Test code: <pre class="prettyprint"><code>import string, re, timeit pat = re.compile('[\w-]*$') pat_inv = re.compile ('[^\w-]') allowed_chars=string.ascii_letters + string.digits + '_-' allowed_set = set(allowed_chars) trans_table = string.maketrans('','') def check_set_diff(s): return not set(s) - allowed_set def check_set_all(s): return all(x in allowed_set for x in s) def check_set_subset(s): return set(s).issubset(allowed_set) def check_re_match(s): return pat.match(s) def check_re_inverse(s): # Search for non-matching character. return not pat_inv.search(s) def check_trans(s): return not s.translate(trans_table,allowed_chars) test_long_almost_valid='a_very_long_string_that_is_mostly_valid_except_for_last_char'*99 + '!' test_long_valid='a_very_long_string_that_is_completely_valid_' * 99 test_short_valid='short_valid_string' test_short_invalid='/$%$%&' test_long_invalid='/$%$%&' * 99 test_empty='' def main(): funcs = sorted(f for f in globals() if f.startswith('check_')) tests = sorted(f for f in globals() if f.startswith('test_')) for test in tests: print "Test %-15s (length = %d):" % (test, len(globals()[test])) for func in funcs: print " %-20s : %.3f" % (func, timeit.Timer('%s(%s)' % (func, test), 'from __main__ import pat,allowed_set,%s' % ','.join(funcs+tests)).timeit(10000)) print if __name__=='__main__': main() </code></pre> The results on my system are: <pre class="prettyprint"><code>Test test_empty (length = 0): check_re_inverse : 0.042 check_re_match : 0.030 check_set_all : 0.027 check_set_diff : 0.029 check_set_subset : 0.029 check_trans : 0.014 Test test_long_almost_valid (length = 5941): check_re_inverse : 2.690 check_re_match : 3.037 check_set_all : 18.860 check_set_diff : 2.905 check_set_subset : 2.903 check_trans : 0.182 Test test_long_invalid (length = 594): check_re_inverse : 0.017 check_re_match : 0.015 check_set_all : 0.044 check_set_diff : 0.311 check_set_subset : 0.308 check_trans : 0.034 Test test_long_valid (length = 4356): check_re_inverse : 1.890 check_re_match : 1.010 check_set_all : 14.411 check_set_diff : 2.101 check_set_subset : 2.333 check_trans : 0.140 Test test_short_invalid (length = 6): check_re_inverse : 0.017 check_re_match : 0.019 check_set_all : 0.044 check_set_diff : 0.032 check_set_subset : 0.037 check_trans : 0.015 Test test_short_valid (length = 18): check_re_inverse : 0.125 check_re_match : 0.066 check_set_all : 0.104 check_set_diff : 0.051 check_set_subset : 0.046 check_trans : 0.017 </code></pre> The translate approach seems best in most cases, dramatically so with long valid strings, but is beaten out by regexes in test_long_invalid (Presumably because the regex can bail out immediately, but translate always has to scan the whole string). The set approaches are usually worst, beating regexes only for the empty string case. Using all(x in allowed_set for x in s) performs well if it bails out early, but can be bad if it has to iterate through every character. isSubSet and set difference are comparable, and are consistently proportional to the length of the string regardless of the data. There's a similar difference between the regex methods matching all valid characters and searching for invalid characters. Matching performs a little better when checking for a long, but fully valid string, but worse for invalid characters near the end of the string.

How do I verify that a string only contains letters, numbers, underscores and dashes?

2 Answers

A regular expression will do the trick with very little code:

import re  ...  if re.match("^[A-Za-z0-9_-]*$", my_little_string):     # do something here

177

answered Sep 28 '22 06:09

Thomas

[Edit] There's another solution not mentioned yet, and it seems to outperform the others given so far in most cases.

Use string.translate to replace all valid characters in the string, and see if we have any invalid ones left over. This is pretty fast as it uses the underlying C function to do the work, with very little python bytecode involved.

Obviously performance isn't everything - going for the most readable solutions is probably the best approach when not in a performance critical codepath, but just to see how the solutions stack up, here's a performance comparison of all the methods proposed so far. check_trans is the one using the string.translate method.

Test code:

import string, re, timeit  pat = re.compile('[\w-]*$') pat_inv = re.compile ('[^\w-]') allowed_chars=string.ascii_letters + string.digits + '_-' allowed_set = set(allowed_chars) trans_table = string.maketrans('','')  def check_set_diff(s):     return not set(s) - allowed_set  def check_set_all(s):     return all(x in allowed_set for x in s)  def check_set_subset(s):     return set(s).issubset(allowed_set)  def check_re_match(s):     return pat.match(s)  def check_re_inverse(s): # Search for non-matching character.     return not pat_inv.search(s)  def check_trans(s):     return not s.translate(trans_table,allowed_chars)  test_long_almost_valid='a_very_long_string_that_is_mostly_valid_except_for_last_char'*99 + '!' test_long_valid='a_very_long_string_that_is_completely_valid_' * 99 test_short_valid='short_valid_string' test_short_invalid='/$%$%&' test_long_invalid='/$%$%&' * 99 test_empty=''  def main():     funcs = sorted(f for f in globals() if f.startswith('check_'))     tests = sorted(f for f in globals() if f.startswith('test_'))     for test in tests:         print "Test %-15s (length = %d):" % (test, len(globals()[test]))         for func in funcs:             print "  %-20s : %.3f" % (func,                     timeit.Timer('%s(%s)' % (func, test), 'from __main__ import pat,allowed_set,%s' % ','.join(funcs+tests)).timeit(10000))         print  if __name__=='__main__': main()

The results on my system are:

Test test_empty      (length = 0):   check_re_inverse     : 0.042   check_re_match       : 0.030   check_set_all        : 0.027   check_set_diff       : 0.029   check_set_subset     : 0.029   check_trans          : 0.014  Test test_long_almost_valid (length = 5941):   check_re_inverse     : 2.690   check_re_match       : 3.037   check_set_all        : 18.860   check_set_diff       : 2.905   check_set_subset     : 2.903   check_trans          : 0.182  Test test_long_invalid (length = 594):   check_re_inverse     : 0.017   check_re_match       : 0.015   check_set_all        : 0.044   check_set_diff       : 0.311   check_set_subset     : 0.308   check_trans          : 0.034  Test test_long_valid (length = 4356):   check_re_inverse     : 1.890   check_re_match       : 1.010   check_set_all        : 14.411   check_set_diff       : 2.101   check_set_subset     : 2.333   check_trans          : 0.140  Test test_short_invalid (length = 6):   check_re_inverse     : 0.017   check_re_match       : 0.019   check_set_all        : 0.044   check_set_diff       : 0.032   check_set_subset     : 0.037   check_trans          : 0.015  Test test_short_valid (length = 18):   check_re_inverse     : 0.125   check_re_match       : 0.066   check_set_all        : 0.104   check_set_diff       : 0.051   check_set_subset     : 0.046   check_trans          : 0.017

The translate approach seems best in most cases, dramatically so with long valid strings, but is beaten out by regexes in test_long_invalid (Presumably because the regex can bail out immediately, but translate always has to scan the whole string). The set approaches are usually worst, beating regexes only for the empty string case.

Using all(x in allowed_set for x in s) performs well if it bails out early, but can be bad if it has to iterate through every character. isSubSet and set difference are comparable, and are consistently proportional to the length of the string regardless of the data.

There's a similar difference between the regex methods matching all valid characters and searching for invalid characters. Matching performs a little better when checking for a long, but fully valid string, but worse for invalid characters near the end of the string.

answered Sep 28 '22 04:09

9 revs

Related questions
                            
                                Adding code to __init__.py
                            
                                Python exception chaining [duplicate]
                            
                                Fast way to copy dictionary in Python
                            
                                Python script header
                            
                                Syntax highlighting in vim for python
                            
                                How to run Python script on terminal?
                            
                                Calculate mean across dimension in a 2D array
                            
                                What is the difference between subprocess.popen and subprocess.run
                            
                                pretty-print json in python (pythonic way)
                            
                                sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings
                            
                                How to config nltk data directory from code?
                            
                                How to import a module in Python with importlib.import_module
                            
                                Default arguments with *args and **kwargs
                            
                                set random seed programwide in python
                            
                                In Flask, what is "request.args" and how is it used?
                            
                                python JSON only get keys in first level
                            
                                OpenCV & Python - Image too big to display
                            
                                How do I compare two strings in python?
                            
                                Accessing a class' member variables in Python?
                            
                                Looking for a good Python Tree data structure [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I verify that a string only contains letters, numbers, underscores and dashes?

Tags:

python

string

regex

Ethan Post

People also ask

2 Answers

Thomas

9 revs

Recent Activity

Donate For Us