Under Python, when you want to obtain the index of the first occurrence of a substring or character within a list, you use something like this:
s.find("f")
However, I'd like to find the index of the first character within the string that does not match. Currently, I'm using the following:
iNum = 0
for i, c in enumerate(line):
if(c != mark):
iNum = i
break
Is there a more efficient way to do this, such as a built-in function I don't know about?
[^f] will match any character except for f , and the start() method of a Match object (returned by re.search() ) will give the index that the match occurred. If you prefer not to use regex, you won't do any better than your current method. You could use something like next(i for i, c in enumerate(line) if c !=
What you could do is first check / match if the string contains invalid characters using a negated character class and using \p{L} and add the characters .!_-
The strspn() function returns the index of the first character found. This value is equal to the length of the initial substring of string1 that consists entirely of characters from string2 .
I had this same problem and looked into timing the solutions here (except the map/list-comp ones from @wwii which are significantly slower than any other options). I also added in a Cython version of the original version.
I made and tested these all in Python v2.7. I was using byte-strings (instead of Unicode strings). I am unsure if the regular-expression methods need something different to work with byte-strings in Python v3. The 'mark' is hard-coded to being the null byte. This could be easily changed.
All methods return -1 if the entire byte-string is the null-byte. All of these were tested in IPython (lines starting with % are special).
import re
def f1(s): # original version
for i, c in enumerate(s):
if c != b'\0': return i
return -1
def f2(s): # @ChristopherMahan's version
i = 0
for c in s:
if c != b'\0': return i
i += 1
return -1
def f3(s): # @AndrewClark's alternate version
# modified to use optional default argument instead of catching StopIteration
return next((i for i, c in enumerate(s) if c != b'\0'), -1)
def f4(s): # @AndrewClark's version
match = re.search(br'[^\0]', s)
return match.start() if match else -1
_re = re.compile(br'[^\0]')
def f5(s): # @AndrewClark's version w/ precompiled regular expression
match = _re.search(s)
return match.start() if match else -1
%load_ext cythonmagic
%%cython
# original version optimized in Cython
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def f6(bytes s):
cdef Py_ssize_t i
for i in xrange(len(s)):
if s[i] != b'\0': return i
return -1
The timing results:
s = (b'\x00' * 32) + (b'\x01' * 32) # test string
In [11]: %timeit f1(s) # original version
100000 loops, best of 3: 2.48 µs per loop
In [12]: %timeit f2(s) # @ChristopherMahan's version
100000 loops, best of 3: 2.35 µs per loop
In [13]: %timeit f3(s) # @AndrewClark's alternate version
100000 loops, best of 3: 3.07 µs per loop
In [14]: %timeit f4(s) # @AndrewClark's version
1000000 loops, best of 3: 1.91 µs per loop
In [15]: %timeit f5(s) # @AndrewClark's version w/ precompiled regular expression
1000000 loops, best of 3: 845 ns per loop
In [16]: %timeit f6(s) # original version optimized in Cython
1000000 loops, best of 3: 305 ns per loop
Overall, @ChristopherMahan's version is slightly faster than the original (apparently enumerate
is slower than using your own counter). Using the next
(@AndrewClark's alternate version) method is slower than the original even though it is essentially the same thing in a one-line form.
Using regular-expresions (@AndrewClark's version) are significantly faster than a loop, especially if you pre-compile the regex!
Then, if you can use Cython, it is by far the fastest. The OP's concern that using a regex is slow is validated, but a loop in Python is even slower. The loop in Cython is quite fast.
You can use regular expressions, for example:
>>> import re
>>> re.search(r'[^f]', 'ffffooooooooo').start()
4
[^f]
will match any character except for f
, and the start()
method of a Match object (returned by re.search()
) will give the index that the match occurred.
To make sure you can also handle empty strings or strings that only contain f
you would want to check to make sure the result of re.search()
is not None
, which will happen if the regex cannot be matched. For example:
first_index = -1
match = re.search(r'[^f]', line)
if match:
first_index = match.start()
If you prefer not to use regex, you won't do any better than your current method. You could use something like next(i for i, c in enumerate(line) if c != mark)
, but you would need to wrap this with a try
and except StopIteration
block to handle empty lines or lines that consist of only mark
characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With