so given "needle" and "there is a needle in this but not thisneedle haystack" I wrote <pre class="prettyprint"><code>def find_needle(n,h): count = 0 words = h.split(" ") for word in words: if word == n: count += 1 return count </code></pre> This is O(n) but wondering if there is a better approach? maybe not by using split at all? How would you write tests for this case to check that it handles all edge cases?

I don't think it's possible to get bellow <code>O(n)</code> with this (because you need to iterate trough the string at least once). You can do some optimizations. I assume you want to match "whole words", for example looking up <code>foo</code> should match like this: <pre class="prettyprint lang-none prettyprint-override"><code>foo and foo, or foobar and not foo. ^^^ ^^^ ^^^ </code></pre> So splinting just based on space wouldn't do the job, because: <pre class="prettyprint"><code>>>> 'foo and foo, or foobar and not foo.'.split(' ') ['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.'] # ^ ^ </code></pre> This is where <code>re</code> module comes in handy, which will allows you to build fascinating conditions. For example <code>\b</code> inside the regexp means: <blockquote> Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, <code>\b</code> is defined as the boundary between a <code>\w</code> and a <code>\W</code> character (or vice versa), or between <code>\w</code> and the beginning/end of the string. This means that <code>r'\bfoo\b'</code> matches <code>'foo'</code>, <code>'foo.'</code>, <code>'(foo)'</code>, <code>'bar foo baz'</code> but not <code>'foobar'</code> or <code>'foo3'</code>. </blockquote> So <code>r'\bfoo\b'</code> will match only whole word <code>foo</code>. Also don't forget to use <code>re.escape()</code>: <pre class="prettyprint"><code>>>> re.escape('foo.bar+') 'foo\\.bar\\+' >>> r'\b{}\b'.format(re.escape('foo.bar+')) '\\bfoo\\.bar\\+\\b' </code></pre> All you have to do now is use <code>re.finditer()</code> to scan the string. Based on documentation: <blockquote> Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match. </blockquote> I assume that matches are generated on the fly, so they never have to be in memory at once (which may come in handy with large strings, with many matched items). And in the end just count them: <pre class="prettyprint"><code>>>> r = re.compile(r'\bfoo\b') >>> it = r.finditer('foo and foo, or foobar and not foo.') >>> sum(1 for _ in it) 3 </code></pre>

This does not address the complexity issue but simplifies the code: <pre class="prettyprint"><code>def find_needle(n,h): return h.split().count(n) </code></pre>

You can use Counter <pre class="prettyprint"><code>from collections import Counter def find_needle(n,h): return Counter(h.split())[n] </code></pre> i.e.: <pre class="prettyprint"><code>n = "portugal" h = 'lobito programmer from portugal hello fromportugal portugal' print find_needle(n,h) </code></pre> Output: <pre class="prettyprint"><code>2 </code></pre> <hr> DEMO

finding needle in haystack, what is a better solution?

Tags:

python

dynamic-programming

so given "needle" and "there is a needle in this but not thisneedle haystack"

I wrote

def find_needle(n,h):
    count = 0
    words = h.split(" ")
    for word in words:
        if word == n:
            count += 1
    return count

This is O(n) but wondering if there is a better approach? maybe not by using split at all?

How would you write tests for this case to check that it handles all edge cases?

314

asked Apr 22 '15 23:04

user299709

3 Answers

I don't think it's possible to get bellow O(n) with this (because you need to iterate trough the string at least once). You can do some optimizations.

I assume you want to match "whole words", for example looking up foo should match like this:

foo and foo, or foobar and not foo.
^^^     ^^^                    ^^^

So splinting just based on space wouldn't do the job, because:

>>> 'foo and foo, or foobar and not foo.'.split(' ')
['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.']
#                  ^                                     ^

This is where re module comes in handy, which will allows you to build fascinating conditions. For example \b inside the regexp means:

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

So r'\bfoo\b' will match only whole word foo. Also don't forget to use re.escape():

>>> re.escape('foo.bar+')
'foo\\.bar\\+'
>>> r'\b{}\b'.format(re.escape('foo.bar+'))
'\\bfoo\\.bar\\+\\b'

All you have to do now is use re.finditer() to scan the string. Based on documentation:

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

I assume that matches are generated on the fly, so they never have to be in memory at once (which may come in handy with large strings, with many matched items). And in the end just count them:

>>> r = re.compile(r'\bfoo\b')
>>> it = r.finditer('foo and foo, or foobar and not foo.')
>>> sum(1 for _ in it)
3

answered Sep 25 '22 16:09

Vyktor

This does not address the complexity issue but simplifies the code:

def find_needle(n,h):
    return h.split().count(n)

answered Sep 22 '22 16:09

Jérôme

You can use Counter

from collections import Counter

def find_needle(n,h):
    return Counter(h.split())[n]

i.e.:

n = "portugal"
h = 'lobito programmer from portugal hello fromportugal portugal'

print find_needle(n,h)

Output:

DEMO

answered Sep 22 '22 16:09

Pedro Lobito

Related questions
                            
                                How are import statements in plpython handled?
                            
                                pycurl https error: unable to get local issuer certificate
                            
                                A Python "catch all" method for undefined/unimplemented attributes in classes
                            
                                Fixing "warning: GMP or MPIR library not found; Not building Crypto.PublickKey._fastmath" error on Python 2.7 with CentOS 6.4
                            
                                enforce column encoding with sqlalchemy
                            
                                UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range(128)
                            
                                How to calculate auto-covariance in Python
                            
                                Is deleteLater() necessary in PyQt/PySide?
                            
                                Python - Decorators
                            
                                Why do numpy cov diagonal elements and var functions have different values?
                            
                                What is the proper way to take a directory path as user input?
                            
                                How is irange() any different from range() or xrange()?
                            
                                How to detect minimum version of python that a script required
                            
                                BeautifulSoup4: select elements where attributes are not equal to x
                            
                                How to use const in Cython
                            
                                dynamically loading django apps at runtime
                            
                                Python/OpenCV: Computing a depth map from stereo images
                            
                                Mock a MySQL database in Python
                            
                                Django REST Framework nested resource key "id" unaccessible
                            
                                Logarithmic interpolation in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With