I have a simple task I need to perform in Python, which is to convert a string to all lowercase and strip out all non-ascii non-alpha characters.
For example:
"This is a Test" -> "thisisatest" "A235th@#$&( er Ra{}|?>ndom" -> "atherrandom"
I have a simple function to do this:
import string import sys def strip_string_to_lowercase(s): tmpStr = s.lower().strip() retStrList = [] for x in tmpStr: if x in string.ascii_lowercase: retStrList.append(x) return ''.join(retStrList)
But I cannot help thinking there is a more efficient, or more elegant, way.
Thanks!
Edit:
Thanks to all those that answered. I learned, and in some cases re-learned, a good deal of python.
The Python lower() method converts all characters in a string to lowercase. Numbers and special characters are left unchanged. lower() is added to the end of a Python string value. The lower() function takes in no parameters.
The toLowerCase() method converts a string to lower case letters. Note: The toUpperCase() method converts a string to upper case letters.
To strip non printable characters from a string in Python, we can call the isprintable method on each character and use list comprehension. to check if each character in my_string is printable with isprintable . Then we call ''. join with the iterator to join the printable characters in my_string back to a string.
Another solution (not that pythonic, but very fast) is to use string.translate - though note that this will not work for unicode. It's also worth noting that you can speed up Dana's code by moving the characters into a set (which looks up by hash, rather than performing a linear search each time). Here are the timings I get for various of the solutions given:
import string, re, timeit # Precomputed values (for str_join_set and translate) letter_set = frozenset(string.ascii_lowercase + string.ascii_uppercase) tab = string.maketrans(string.ascii_lowercase + string.ascii_uppercase, string.ascii_lowercase * 2) deletions = ''.join(ch for ch in map(chr,range(256)) if ch not in letter_set) s="A235th@#$&( er Ra{}|?>ndom" # From unwind's filter approach def test_filter(s): return filter(lambda x: x in string.ascii_lowercase, s.lower()) # using set instead (and contains) def test_filter_set(s): return filter(letter_set.__contains__, s).lower() # Tomalak's solution def test_regex(s): return re.sub('[^a-z]', '', s.lower()) # Dana's def test_str_join(s): return ''.join(c for c in s.lower() if c in string.ascii_lowercase) # Modified to use a set. def test_str_join_set(s): return ''.join(c for c in s.lower() if c in letter_set) # Translate approach. def test_translate(s): return string.translate(s, tab, deletions) for test in sorted(globals()): if test.startswith("test_"): assert globals()[test](s)=='atherrandom' print "%30s : %s" % (test, timeit.Timer("f(s)", "from __main__ import %s as f, s" % test).timeit(200000))
This gives me:
test_filter : 2.57138351271 test_filter_set : 0.981806765698 test_regex : 3.10069885233 test_str_join : 2.87172979743 test_str_join_set : 2.43197956381 test_translate : 0.335367566218
[Edit] Updated with filter solutions as well. (Note that using set.__contains__
makes a big difference here, as it avoids making an extra function call for the lambda.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With