Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Python, how to check if a string only contains certain characters?

In Python, how to check if a string only contains certain characters?

I need to check a string containing only a..z, 0..9, and . (period) and no other character.

I could iterate over each character and check the character is a..z or 0..9, or . but that would be slow.

I am not clear now how to do it with a regular expression.

Is this correct? Can you suggest a simpler regular expression or a more efficient approach.

#Valid chars . a-z 0-9 
def check(test_str):
    import re
    #http://docs.python.org/library/re.html
    #re.search returns None if no position in the string matches the pattern
    #pattern to search for any character other then . a-z 0-9
    pattern = r'[^\.a-z0-9]'
    if re.search(pattern, test_str):
        #Character other then . a-z 0-9 was found
        print 'Invalid : %r' % (test_str,)
    else:
        #No character other then . a-z 0-9 was found
        print 'Valid   : %r' % (test_str,)

check(test_str='abcde.1')
check(test_str='abcde.1#')
check(test_str='ABCDE.12')
check(test_str='_-/>"!@#12345abcde<')

'''
Output:
>>> 
Valid   : "abcde.1"
Invalid : "abcde.1#"
Invalid : "ABCDE.12"
Invalid : "_-/>"!@#12345abcde<"
'''
like image 556
X10 Avatar asked Aug 24 '09 16:08

X10


People also ask

How do I check if a string contains a specific character?

Use the String. includes() method to check if a string contains a character, e.g. if (str. includes(char)) {} . The include() method will return true if the string contains the provided character, otherwise false is returned.

How do you check if a string has a certain substring Python?

Using the 'in' operator: The in operator is the easiest and pythonic way to check if a python string contains a substring. The in and not in are membership operators, they take in two arguments and evaluate if one is a member of the other. They return a boolean value.

How do you check if a string contains a specific substring?

You can use contains(), indexOf() and lastIndexOf() method to check if one String contains another String in Java or not. If a String contains another String then it's known as a substring. The indexOf() method accepts a String and returns the starting position of the string if it exists, otherwise, it will return -1.

How do you check if certain letters are in a list Python?

To check if the list contains an element in Python, use the “in” operator. The “in” operator checks if the list contains a specific item or not. It can also check if the element exists on the list or not using the list. count() function.


4 Answers

Here's a simple, pure-Python implementation. It should be used when performance is not critical (included for future Googlers).

import string
allowed = set(string.ascii_lowercase + string.digits + '.')

def check(test_str):
    set(test_str) <= allowed

Regarding performance, iteration will probably be the fastest method. Regexes have to iterate through a state machine, and the set equality solution has to build a temporary set. However, the difference is unlikely to matter much. If performance of this function is very important, write it as a C extension module with a switch statement (which will be compiled to a jump table).

Here's a C implementation, which uses if statements due to space constraints. If you absolutely need the tiny bit of extra speed, write out the switch-case. In my tests, it performs very well (2 seconds vs 9 seconds in benchmarks against the regex).

#define PY_SSIZE_T_CLEAN
#include <Python.h>

static PyObject *check(PyObject *self, PyObject *args)
{
        const char *s;
        Py_ssize_t count, ii;
        char c;
        if (0 == PyArg_ParseTuple (args, "s#", &s, &count)) {
                return NULL;
        }
        for (ii = 0; ii < count; ii++) {
                c = s[ii];
                if ((c < '0' && c != '.') || c > 'z') {
                        Py_RETURN_FALSE;
                }
                if (c > '9' && c < 'a') {
                        Py_RETURN_FALSE;
                }
        }

        Py_RETURN_TRUE;
}

PyDoc_STRVAR (DOC, "Fast stringcheck");
static PyMethodDef PROCEDURES[] = {
        {"check", (PyCFunction) (check), METH_VARARGS, NULL},
        {NULL, NULL}
};
PyMODINIT_FUNC
initstringcheck (void) {
        Py_InitModule3 ("stringcheck", PROCEDURES, DOC);
}

Include it in your setup.py:

from distutils.core import setup, Extension
ext_modules = [
    Extension ('stringcheck', ['stringcheck.c']),
],

Use as:

>>> from stringcheck import check
>>> check("abc")
True
>>> check("ABC")
False
like image 192
John Millikin Avatar answered Oct 19 '22 14:10

John Millikin


Final(?) edit

Answer, wrapped up in a function, with annotated interactive session:

>>> import re
>>> def special_match(strg, search=re.compile(r'[^a-z0-9.]').search):
...     return not bool(search(strg))
...
>>> special_match("")
True
>>> special_match("az09.")
True
>>> special_match("az09.\n")
False
# The above test case is to catch out any attempt to use re.match()
# with a `$` instead of `\Z` -- see point (6) below.
>>> special_match("az09.#")
False
>>> special_match("az09.X")
False
>>>

Note: There is a comparison with using re.match() further down in this answer. Further timings show that match() would win with much longer strings; match() seems to have a much larger overhead than search() when the final answer is True; this is puzzling (perhaps it's the cost of returning a MatchObject instead of None) and may warrant further rummaging.

==== Earlier text ====

The [previously] accepted answer could use a few improvements:

(1) Presentation gives the appearance of being the result of an interactive Python session:

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True

but match() doesn't return True

(2) For use with match(), the ^ at the start of the pattern is redundant, and appears to be slightly slower than the same pattern without the ^

(3) Should foster the use of raw string automatically unthinkingly for any re pattern

(4) The backslash in front of the dot/period is redundant

(5) Slower than the OP's code!

prompt>rem OP's version -- NOTE: OP used raw string!

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9\.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.43 usec per loop

prompt>rem OP's version w/o backslash

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[^a-z0-9.]')" "not bool(reg.search(t))"
1000000 loops, best of 3: 1.44 usec per loop

prompt>rem cleaned-up version of accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile(r'[a-z0-9.]+\Z')" "bool(reg.match(t))"
100000 loops, best of 3: 2.07 usec per loop

prompt>rem accepted answer

prompt>\python26\python -mtimeit -s"t='jsdlfjdsf12324..3432jsdflsdf';import
re;reg=re.compile('^[a-z0-9\.]+$')" "bool(reg.match(t))"
100000 loops, best of 3: 2.08 usec per loop

(6) Can produce the wrong answer!!

>>> import re
>>> bool(re.compile('^[a-z0-9\.]+$').match('1234\n'))
True # uh-oh
>>> bool(re.compile('^[a-z0-9\.]+\Z').match('1234\n'))
False
like image 54
John Machin Avatar answered Oct 19 '22 16:10

John Machin


Simpler approach? A little more Pythonic?

>>> ok = "0123456789abcdef"
>>> all(c in ok for c in "123456abc")
True
>>> all(c in ok for c in "hello world")
False

It certainly isn't the most efficient, but it's sure readable.

like image 47
Mark Rushakoff Avatar answered Oct 19 '22 14:10

Mark Rushakoff


EDIT: Changed the regular expression to exclude A-Z

Regular expression solution is the fastest pure python solution so far

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True
>>> timeit.Timer("reg.match('jsdlfjdsf12324..3432jsdflsdf')", "import re; reg=re.compile('^[a-z0-9\.]+$')").timeit()
0.70509696006774902

Compared to other solutions:

>>> timeit.Timer("set('jsdlfjdsf12324..3432jsdflsdf') <= allowed", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
3.2119350433349609
>>> timeit.Timer("all(c in allowed for c in 'jsdlfjdsf12324..3432jsdflsdf')", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
6.7066690921783447

If you want to allow empty strings then change it to:

reg=re.compile('^[a-z0-9\.]*$')
>>>reg.match('')
False

Under request I'm going to return the other part of the answer. But please note that the following accept A-Z range.

You can use isalnum

test_str.replace('.', '').isalnum()

>>> 'test123.3'.replace('.', '').isalnum()
True
>>> 'test123-3'.replace('.', '').isalnum()
False

EDIT Using isalnum is much more efficient than the set solution

>>> timeit.Timer("'jsdlfjdsf12324..3432jsdflsdf'.replace('.', '').isalnum()").timeit()
0.63245487213134766

EDIT2 John gave an example where the above doesn't work. I changed the solution to overcome this special case by using encode

test_str.replace('.', '').encode('ascii', 'replace').isalnum()

And it is still almost 3 times faster than the set solution

timeit.Timer("u'ABC\u0131\u0661'.encode('ascii', 'replace').replace('.','').isalnum()", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
1.5719811916351318

In my opinion using regular expressions is the best to solve this problem

like image 16
Nadia Alramli Avatar answered Oct 19 '22 15:10

Nadia Alramli