Python: Find first non-matching character

Tags:

string

Under Python, when you want to obtain the index of the first occurrence of a substring or character within a list, you use something like this:

s.find("f")

However, I'd like to find the index of the first character within the string that does not match. Currently, I'm using the following:

iNum = 0
for i, c in enumerate(line):
  if(c != mark):
    iNum = i
    break

Is there a more efficient way to do this, such as a built-in function I don't know about?

228

asked Oct 04 '13 21:10

2 Answers

I had this same problem and looked into timing the solutions here (except the map/list-comp ones from @wwii which are significantly slower than any other options). I also added in a Cython version of the original version.

I made and tested these all in Python v2.7. I was using byte-strings (instead of Unicode strings). I am unsure if the regular-expression methods need something different to work with byte-strings in Python v3. The 'mark' is hard-coded to being the null byte. This could be easily changed.

All methods return -1 if the entire byte-string is the null-byte. All of these were tested in IPython (lines starting with % are special).

import re

def f1(s): # original version
    for i, c in enumerate(s):
        if c != b'\0': return i
    return -1

def f2(s): # @ChristopherMahan's version
    i = 0
    for c in s:
        if c != b'\0': return i
        i += 1
    return -1

def f3(s): # @AndrewClark's alternate version
    # modified to use optional default argument instead of catching StopIteration
    return next((i for i, c in enumerate(s) if c != b'\0'), -1)

def f4(s): # @AndrewClark's version
    match = re.search(br'[^\0]', s)
    return match.start() if match else -1

_re = re.compile(br'[^\0]')
def f5(s): # @AndrewClark's version w/ precompiled regular expression
    match = _re.search(s)
    return match.start() if match else -1

%load_ext cythonmagic
%%cython
# original version optimized in Cython
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def f6(bytes s):
    cdef Py_ssize_t i
    for i in xrange(len(s)):
        if s[i] != b'\0': return i
    return -1

The timing results:

s = (b'\x00' * 32) + (b'\x01' * 32) # test string

In [11]: %timeit f1(s) # original version
100000 loops, best of 3: 2.48 µs per loop

In [12]: %timeit f2(s) # @ChristopherMahan's version
100000 loops, best of 3: 2.35 µs per loop

In [13]: %timeit f3(s) # @AndrewClark's alternate version
100000 loops, best of 3: 3.07 µs per loop

In [14]: %timeit f4(s) # @AndrewClark's version
1000000 loops, best of 3: 1.91 µs per loop

In [15]: %timeit f5(s) # @AndrewClark's version w/ precompiled regular expression
1000000 loops, best of 3: 845 ns per loop

In [16]: %timeit f6(s) # original version optimized in Cython
1000000 loops, best of 3: 305 ns per loop

Overall, @ChristopherMahan's version is slightly faster than the original (apparently enumerate is slower than using your own counter). Using the next (@AndrewClark's alternate version) method is slower than the original even though it is essentially the same thing in a one-line form.

Using regular-expresions (@AndrewClark's version) are significantly faster than a loop, especially if you pre-compile the regex!

Then, if you can use Cython, it is by far the fastest. The OP's concern that using a regex is slow is validated, but a loop in Python is even slower. The loop in Cython is quite fast.

102

answered Oct 05 '22 17:10

coderforlife

You can use regular expressions, for example:

>>> import re
>>> re.search(r'[^f]', 'ffffooooooooo').start()
4

[^f] will match any character except for f, and the start() method of a Match object (returned by re.search()) will give the index that the match occurred.

To make sure you can also handle empty strings or strings that only contain f you would want to check to make sure the result of re.search() is not None, which will happen if the regex cannot be matched. For example:

first_index = -1
match = re.search(r'[^f]', line)
if match:
    first_index = match.start()

If you prefer not to use regex, you won't do any better than your current method. You could use something like next(i for i, c in enumerate(line) if c != mark), but you would need to wrap this with a try and except StopIteration block to handle empty lines or lines that consist of only mark characters.

answered Oct 05 '22 19:10

Andrew Clark

Related questions
                            
                                Most efficient way to calculate pairwise similarity of 250k lists
                            
                                Writing a Python compiler for practice [closed]
                            
                                Is it possible to specify the previous directory python?
                            
                                Python json.loads doesn't work
                            
                                Iteration over variable names in python?
                            
                                Is there an easy way generate a probable list of words from an unspaced sentence in python?
                            
                                Upper/lower limits with matplotlib
                            
                                Is there a standard way to store XY data in Python?
                            
                                Reading Multiple CSV Files into Python Pandas Dataframe
                            
                                Get progress from async python celery chain by chain id
                            
                                How to store application settings across modules [duplicate]
                            
                                Efficient extraction of a subgraph according to some edge attribute in NetworkX
                            
                                flask jinja2 href not linking correctly
                            
                                Imported modules become None when running a function
                            
                                Periodogram in Octave/Matlab vs Scipy
                            
                                Summing across rows of Pandas Dataframe
                            
                                writing back into the same file after reading from the file
                            
                                POST request with Multipart/form-data. Content-type not correct
                            
                                How to match a paragraph using regex
                            
                                How to pass complex objects across view functions/sessions in Flask

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: Find first non-matching character

Tags:

python

string

Zauber Paracelsus

People also ask

2 Answers

coderforlife

Andrew Clark

Recent Activity

Donate For Us