Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding needle in haystack, what is a better solution?

so given "needle" and "there is a needle in this but not thisneedle haystack"

I wrote

def find_needle(n,h):
    count = 0
    words = h.split(" ")
    for word in words:
        if word == n:
            count += 1
    return count

This is O(n) but wondering if there is a better approach? maybe not by using split at all?

How would you write tests for this case to check that it handles all edge cases?

like image 314
user299709 Avatar asked Apr 22 '15 23:04

user299709


People also ask

Is it possible to find a needle in a haystack?

The phrase “like finding a needle in a haystack” is a similie — sometimes, the thing we're in search of is going to be so hard to find, it's functionally impossible to do so.

How do you find a needle from a pile of dry grass?

Just put the haystack in a big water container, the needle sinks and the haystack floats.

What is haystack problem?

The Needle in a Haystack Problem is a nick name for something technically known as imbalanced or skewed datasets. A dataset is a representation of objects or events in the real life, and an imbalanced dataset is characterized when most of the examples contains a specific value in one or more columns.

What does it mean looking for a needle in a haystack?

Definition of a needle in a haystack informal. : someone or something that is very hard to find Searching for your earring at the park will be like looking for a needle in a haystack.


3 Answers

I don't think it's possible to get bellow O(n) with this (because you need to iterate trough the string at least once). You can do some optimizations.

I assume you want to match "whole words", for example looking up foo should match like this:

foo and foo, or foobar and not foo.
^^^     ^^^                    ^^^

So splinting just based on space wouldn't do the job, because:

>>> 'foo and foo, or foobar and not foo.'.split(' ')
['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.']
#                  ^                                     ^

This is where re module comes in handy, which will allows you to build fascinating conditions. For example \b inside the regexp means:

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

So r'\bfoo\b' will match only whole word foo. Also don't forget to use re.escape():

>>> re.escape('foo.bar+')
'foo\\.bar\\+'
>>> r'\b{}\b'.format(re.escape('foo.bar+'))
'\\bfoo\\.bar\\+\\b'

All you have to do now is use re.finditer() to scan the string. Based on documentation:

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

I assume that matches are generated on the fly, so they never have to be in memory at once (which may come in handy with large strings, with many matched items). And in the end just count them:

>>> r = re.compile(r'\bfoo\b')
>>> it = r.finditer('foo and foo, or foobar and not foo.')
>>> sum(1 for _ in it)
3
like image 91
Vyktor Avatar answered Sep 25 '22 16:09

Vyktor


This does not address the complexity issue but simplifies the code:

def find_needle(n,h):
    return h.split().count(n)
like image 41
Jérôme Avatar answered Sep 22 '22 16:09

Jérôme


You can use Counter

from collections import Counter

def find_needle(n,h):
    return Counter(h.split())[n]

i.e.:

n = "portugal"
h = 'lobito programmer from portugal hello fromportugal portugal'

print find_needle(n,h)

Output:

2

DEMO

like image 25
Pedro Lobito Avatar answered Sep 22 '22 16:09

Pedro Lobito