Given the target ('b', 'a')
and the inputs:
x0 = ('b', 'a', 'z', 'z')
x1 = ('b', 'a', 'z', 'z')
x2 = ('z', 'z', 'a', 'a')
x3 = ('z', 'b', 'a', 'a')
The aim is to find the location of the continuous ('b', 'a')
element and get the output:
>>> find_ba(x0)
0
>>> find_ba(x1)
0
>>> find_ba(x2)
None
>>> find_ba(x3)
1
Using the pairwise
recipe:
from itertools import tee
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
I could do this to get the desired output:
def find_ba(x, target=('b', 'a')):
try:
return next(i for i, pair in enumerate(pairwise(x)) if pair == target)
except StopIteration:
return None
But that would require me to loop through all pairs of characters till I find the first instance. Is there a way to find index of pairwise elements without looping all the characters?
Answering @MatthiasFripp's question in the comments:
Are your elements in lists or types (as shown) or in a generator (e.g. reading from a file handle)?
The x* are all tuples of strings. So they can be access through the index. But if the answer/solution can work for tuples and generator, that'll be great!
Can you say about how many lists you have to search and about how long they are? That would help for suggesting a search strategy.
The lengths of the tuples are not fixed. They can be of size > 2.
indexOf() The indexOf() method returns the first index at which a given element can be found in the array, or -1 if it is not present.
The pairwise() method in the itertools module returns the successive overlapping pairs taken from the input iterable. The sequence of the output pairs is in the order of the elements in the iterable. This functionality was introduced in Python version 3.10.
The fastest general search algorithm will have O(n)
average performance (called linear search), that means you have no alternative (except maybe for a constant factor) than to process each element.
Given your question:
Is there a way to finding index of pairwise elements without looping all the characters?
That's possible (it's still O(n)
though) by only looking at each second item:
from itertools import count
def find_ab(tup):
for idx in count(start=1, step=2):
try:
if tup[idx] == 'b':
if tup[idx+1] == 'a':
return idx
elif tup[idx] == 'a':
if tup[idx-1] == 'b':
return idx-1
except IndexError:
break
In the worst case it will still compare all items but it will skip one item for every odd-indexed item that isn't 'b'
or 'a'
.
That's a bit like cheating so let me explain why common alternatives aren't possible in your case:
Binary search only needs to compare log(n)
items but it requires the sequence to be sorted. Your examples aren't sorted so sorting them would require O(n*log(n))
operations - which wouldn't just process each item once, it would process some of them several times. Not that I know a sensible way to sort adjacent elements anyway.
You have tuples so creating a hashtable (a dict
) doesn't make sense because in order to create that structure you would need to process each element.
But if you plan to do several of these searches for pairs you could create the dictionary once (O(n)
) and do many searches afterwards in O(1)
:
d = {}
for idx, pair in enumerate(pairwise(x0)):
if pair not in d: # keep only the first index for each pair
d[pair] = idx
>>> d.get(('b', 'a'), None)
0
However that approach is far slower if you only want to search for one pair, because you lose the "short-circuit behaviour" (stops as soon as a match is found) and you process all elements when creating the dictionary.
Besides the general approaches:
O(n)
linear searchO(log(n))
binary search (for sorted data)O(1)
lookups (for hashable lookups or other search problems that just need to search in some "bucket"s)You can generally exploit any structure or knowledge about your data to reduce the amount of items you need to process. The problem is mostly that there are (probably) no already existing data structures for these and homemade implementations often end up orders of magnitude slower than naive "process all elements" approaches. But if you have any meta-informations about your sequences then you can exploit it.
The recipe for pairwise is actually quite nice, but you could also use iteration_utilities.successive
1. Last I checked it was roughly 1.5 to 2 times faster than the recipe. Even if you don't change the approach and accept that you need to process all (or almost all) elements in the worst case it may be faster!
That data was probably generated. Maybe it's worthwhile to actually "search" for the elements during creation. That way you don't need to do the extra pass over the data at all. Or you could create the dict
while creating the dataset (that allows for the O(1)
lookups afterwards). Sometimes it's a good idea to look at the process that generated/downloaded/fetched the dataset if there is some way the information can be extracted.
Now, after writing all this text I need to state the obvious:
Your approach is really good. Even if it needs to process all elements in the worst case it uses a perfect fit (pairwise
-recipe) for the problem at hand and it should actually work really fast even for long inputs. For a tuple containing 1 million 'z'
it only needs 200ms on my computer. So you can process several million elements per second (even on old and slow computers like mine). That's probably not fast enough for big-data but then pure-python isn't a good language for processing big-data (typically you would need to write a C extension, use Cython or some NumPy, Pandas or derivative approach). Also the next
function on a generator is lazy (assuming you use itertools.izip
on python2 instead of zip
) so you only process each tuple until you find a match.
Personally, I would simply use your original approach. Or if I had to find several pairs then I would just create the dictionary I mentioned earlier (maybe even serialize it) and do the lookups therein.
The bounty reason explicitly wants "credible and/or official sources". Fortunatly "search algorithms" are well studied so you can find explanations for each of the mentioned approaches in basic text books on algorithms. For example:
Sedgewick & Wayne - Algorithms
Wikipedia: "Linear search"
dict
).There's also a small overview of time complexities of python types in the python wiki: "TimeComplexity". For lookups you have to check "Get Item " or "in".
1 Disclosure: I'm the author of that 3rd party library.
Not much impressive though its working in your case check it out.
we are simply extracting index of matched item in sample and checking if it is consecutive or not.
def consecutive_index(src,sample):
result = None
il = [src.index(a) for a in sample if a in src]
if len(il) == len(sample) and len(range(il[0],il[-1]))==1:
result = il[0]
return result
x0 = ('b', 'a', 'z', 'z')
x1 = ('b', 'a', 'z', 'z')
x2 = ('z', 'z', 'a', 'a')
x3 = ('z', 'b', 'a', 'a')
sample = ('b', 'a')
##TEST your given combinations.
print consecutive_index(x0,sample) #expected 0
print consecutive_index(x1,sample) #expected 0
print consecutive_index(x2,sample) #expected None
print consecutive_index(x3,sample) #expected 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With