Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding index of pairwise elements

Given the target ('b', 'a') and the inputs:

x0 = ('b', 'a', 'z', 'z')
x1 = ('b', 'a', 'z', 'z')
x2 = ('z', 'z', 'a', 'a')
x3 = ('z', 'b', 'a', 'a')

The aim is to find the location of the continuous ('b', 'a') element and get the output:

>>> find_ba(x0)
0
>>> find_ba(x1)
0
>>> find_ba(x2)
None
>>> find_ba(x3)
1

Using the pairwise recipe:

from itertools import tee
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

I could do this to get the desired output:

def find_ba(x, target=('b', 'a')):
    try:
        return next(i for i, pair in enumerate(pairwise(x)) if pair == target)
    except StopIteration:
        return None

But that would require me to loop through all pairs of characters till I find the first instance. Is there a way to find index of pairwise elements without looping all the characters?


Answering @MatthiasFripp's question in the comments:

Are your elements in lists or types (as shown) or in a generator (e.g. reading from a file handle)?

The x* are all tuples of strings. So they can be access through the index. But if the answer/solution can work for tuples and generator, that'll be great!

Can you say about how many lists you have to search and about how long they are? That would help for suggesting a search strategy.

The lengths of the tuples are not fixed. They can be of size > 2.

like image 203
alvas Avatar asked Apr 26 '17 09:04

alvas


People also ask

How do you find the index of an element?

indexOf() The indexOf() method returns the first index at which a given element can be found in the array, or -1 if it is not present.

What is pairwise Python?

The pairwise() method in the itertools module returns the successive overlapping pairs taken from the input iterable. The sequence of the output pairs is in the order of the elements in the iterable. This functionality was introduced in Python version 3.10.


Video Answer


2 Answers

The fastest general search algorithm will have O(n) average performance (called linear search), that means you have no alternative (except maybe for a constant factor) than to process each element.

Given your question:

Is there a way to finding index of pairwise elements without looping all the characters?

That's possible (it's still O(n) though) by only looking at each second item:

from itertools import count

def find_ab(tup):
    for idx in count(start=1, step=2):
        try:
            if tup[idx] == 'b':
                if tup[idx+1] == 'a':
                    return idx
            elif tup[idx] == 'a':
                if tup[idx-1] == 'b':
                    return idx-1
        except IndexError:
            break

In the worst case it will still compare all items but it will skip one item for every odd-indexed item that isn't 'b' or 'a'.

That's a bit like cheating so let me explain why common alternatives aren't possible in your case:

Binary search

Binary search only needs to compare log(n) items but it requires the sequence to be sorted. Your examples aren't sorted so sorting them would require O(n*log(n)) operations - which wouldn't just process each item once, it would process some of them several times. Not that I know a sensible way to sort adjacent elements anyway.

Bucket search (or hashtables)

You have tuples so creating a hashtable (a dict) doesn't make sense because in order to create that structure you would need to process each element.

But if you plan to do several of these searches for pairs you could create the dictionary once (O(n)) and do many searches afterwards in O(1):

d = {}
for idx, pair in enumerate(pairwise(x0)):
    if pair not in d:    # keep only the first index for each pair
        d[pair] = idx

>>> d.get(('b', 'a'), None)
0

However that approach is far slower if you only want to search for one pair, because you lose the "short-circuit behaviour" (stops as soon as a match is found) and you process all elements when creating the dictionary.

Other approaches

Besides the general approaches:

  • O(n) linear search
  • O(log(n)) binary search (for sorted data)
  • O(1) lookups (for hashable lookups or other search problems that just need to search in some "bucket"s)

You can generally exploit any structure or knowledge about your data to reduce the amount of items you need to process. The problem is mostly that there are (probably) no already existing data structures for these and homemade implementations often end up orders of magnitude slower than naive "process all elements" approaches. But if you have any meta-informations about your sequences then you can exploit it.

Final remarks

The recipe for pairwise is actually quite nice, but you could also use iteration_utilities.successive1. Last I checked it was roughly 1.5 to 2 times faster than the recipe. Even if you don't change the approach and accept that you need to process all (or almost all) elements in the worst case it may be faster!

That data was probably generated. Maybe it's worthwhile to actually "search" for the elements during creation. That way you don't need to do the extra pass over the data at all. Or you could create the dict while creating the dataset (that allows for the O(1) lookups afterwards). Sometimes it's a good idea to look at the process that generated/downloaded/fetched the dataset if there is some way the information can be extracted.

Now, after writing all this text I need to state the obvious:

Your approach is really good. Even if it needs to process all elements in the worst case it uses a perfect fit (pairwise-recipe) for the problem at hand and it should actually work really fast even for long inputs. For a tuple containing 1 million 'z' it only needs 200ms on my computer. So you can process several million elements per second (even on old and slow computers like mine). That's probably not fast enough for big-data but then pure-python isn't a good language for processing big-data (typically you would need to write a C extension, use Cython or some NumPy, Pandas or derivative approach). Also the next function on a generator is lazy (assuming you use itertools.izip on python2 instead of zip) so you only process each tuple until you find a match.

Personally, I would simply use your original approach. Or if I had to find several pairs then I would just create the dictionary I mentioned earlier (maybe even serialize it) and do the lookups therein.


The bounty reason explicitly wants "credible and/or official sources". Fortunatly "search algorithms" are well studied so you can find explanations for each of the mentioned approaches in basic text books on algorithms. For example:

  • Cormen, et. al - Introduction to Algorithms
  • Sedgewick & Wayne - Algorithms

  • Wikipedia: "Linear search"

  • Wikipedia: "Binary search"
  • Wikipedia: "Hashtable" (that's essentially a dict).

There's also a small overview of time complexities of python types in the python wiki: "TimeComplexity". For lookups you have to check "Get Item " or "in".


1 Disclosure: I'm the author of that 3rd party library.

like image 86
MSeifert Avatar answered Oct 12 '22 16:10

MSeifert


Not much impressive though its working in your case check it out.

we are simply extracting index of matched item in sample and checking if it is consecutive or not.

def consecutive_index(src,sample):
    result = None
    il = [src.index(a) for a in sample if a in src]
    if len(il) == len(sample) and len(range(il[0],il[-1]))==1:
        result = il[0]
    return result



x0 = ('b', 'a', 'z', 'z')
x1 = ('b', 'a', 'z', 'z')
x2 = ('z', 'z', 'a', 'a')
x3 = ('z', 'b', 'a', 'a')
sample = ('b', 'a')

##TEST your given combinations.
print consecutive_index(x0,sample) #expected 0
print consecutive_index(x1,sample) #expected 0
print consecutive_index(x2,sample) #expected None
print consecutive_index(x3,sample) #expected 1
like image 42
DexJ Avatar answered Oct 12 '22 18:10

DexJ