so given "needle" and "there is a needle in this but not thisneedle haystack"
I wrote
def find_needle(n,h):
count = 0
words = h.split(" ")
for word in words:
if word == n:
count += 1
return count
This is O(n) but wondering if there is a better approach? maybe not by using split at all?
How would you write tests for this case to check that it handles all edge cases?
The phrase “like finding a needle in a haystack” is a similie — sometimes, the thing we're in search of is going to be so hard to find, it's functionally impossible to do so.
Just put the haystack in a big water container, the needle sinks and the haystack floats.
The Needle in a Haystack Problem is a nick name for something technically known as imbalanced or skewed datasets. A dataset is a representation of objects or events in the real life, and an imbalanced dataset is characterized when most of the examples contains a specific value in one or more columns.
Definition of a needle in a haystack informal. : someone or something that is very hard to find Searching for your earring at the park will be like looking for a needle in a haystack.
I don't think it's possible to get bellow O(n)
with this (because you need to iterate trough the string at least once). You can do some optimizations.
I assume you want to match "whole words", for example looking up foo
should match like this:
foo and foo, or foobar and not foo.
^^^ ^^^ ^^^
So splinting just based on space wouldn't do the job, because:
>>> 'foo and foo, or foobar and not foo.'.split(' ')
['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.']
# ^ ^
This is where re
module comes in handy, which will allows you to build fascinating conditions. For example \b
inside the regexp means:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally,
\b
is defined as the boundary between a\w
and a\W
character (or vice versa), or between\w
and the beginning/end of the string. This means thatr'\bfoo\b'
matches'foo'
,'foo.'
,'(foo)'
,'bar foo baz'
but not'foobar'
or'foo3'
.
So r'\bfoo\b'
will match only whole word foo
. Also don't forget to use re.escape()
:
>>> re.escape('foo.bar+')
'foo\\.bar\\+'
>>> r'\b{}\b'.format(re.escape('foo.bar+'))
'\\bfoo\\.bar\\+\\b'
All you have to do now is use re.finditer()
to scan the string. Based on documentation:
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.
I assume that matches are generated on the fly, so they never have to be in memory at once (which may come in handy with large strings, with many matched items). And in the end just count them:
>>> r = re.compile(r'\bfoo\b')
>>> it = r.finditer('foo and foo, or foobar and not foo.')
>>> sum(1 for _ in it)
3
This does not address the complexity issue but simplifies the code:
def find_needle(n,h):
return h.split().count(n)
You can use Counter
from collections import Counter
def find_needle(n,h):
return Counter(h.split())[n]
i.e.:
n = "portugal"
h = 'lobito programmer from portugal hello fromportugal portugal'
print find_needle(n,h)
Output:
2
DEMO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With