Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Complex String Comparisons

Tags:

php

I'm trying to write a function in PHP that takes an array of strings (needle) and performs a comparison against another array of strings (haystack). The purpose of this function is to quickly deliver matching strings for an AJAX search, so it needs to be as fast as possible.

Here's some sample code to illustrate the two arrays;

$needle = array('ba','hot','resta');

$haystack = array(
    'Southern Hotel',
    'Grange Restaurant & Hotel',
    'Austral Hotel',
    'Barsmith Hotel',
    'Errestas'
);

Whilst this is quite easy in itself, the aim of the comparison is to count how many of the needle strings appear in the haystack.

However, there are three constraints;

  1. The comparison is case-insensitive
  2. The needle must only match characters at the beginning of the word. For example, "hote" will match "Hotel", but "resta" will not match "Errestas".
  3. We want to count the number of matching needles, not the number of needle appearances. If a place is named "Hotel Hotel Hotel", we need the result to be 1 not 3.

Using the above example, we'd expect the following associative array as a result:

$haystack = array(
    'Southern Hotel' => 1,
    'Grange Restaurant & Hotel' => 2,
    'Austral Hotel' => 1,
    'Barsmith Hotel' => 2,
    'Erresta'  => 0
);

I've been trying to implement a function to do this, using a preg_match_all() and a regexp which looks like /(\A|\s)(ba|hot|resta)/. Whilst this ensures we only match the beginning of words, it doesn't take into account strings which contain the same needle twice.

I am posting to see whether someone else has a solution?

like image 449
Lachlan McDonald Avatar asked Dec 30 '22 19:12

Lachlan McDonald


1 Answers

I've found you're description of the problem detailed enough that I could take a TDD approach for solving it. So, because I'm so much trying to be a TDD guy, I've wrote the tests and the function to make the tests pass. Namings might not be perfect, but they're easily changeable. The algorithm of the function may also not be the best, but now that there are tests, refactoring should be very easy and painless.

Here are the tests:

class MultiMatcherTest extends PHPUnit_Framework_TestCase
{
    public function testTheComparisonIsCaseInsensitive()
    {
        $needles  = array('hot');
        $haystack = array('Southern Hotel');
        $result   = match($needles, $haystack);

        $this->assertEquals(array('Southern Hotel' => 1), $result);
    }

    public function testNeedleMatchesOnlyCharsAtBeginningOfWord()
    {
        $needles  = array('resta');
        $haystack = array('Errestas');
        $result   = match($needles, $haystack);

        $this->assertEquals(array('Errestas' => 0), $result);
    }

    public function testMatcherCountsNeedlesNotOccurences()
    {
        $needles  = array('hot');
        $haystack = array('Southern Hotel', 'Grange Restaurant & Hotel');
        $expected = array('Southern Hotel'            => 1,
                          'Grange Restaurant & Hotel' => 1);
        $result   = match($needles, $haystack);

        $this->assertEquals($expected, $result);
    }

    public function testAcceptance()
    {
        $needles  = array('ba','hot','resta');
        $haystack = array(
            'Southern Hotel',
            'Grange Restaurant & Hotel',
            'Austral Hotel',
            'Barsmith Hotel',
            'Errestas',
        );
        $expected = array(
            'Southern Hotel'            => 1,
            'Grange Restaurant & Hotel' => 2,
            'Austral Hotel'             => 1,
            'Barsmith Hotel'            => 2,
            'Errestas'                  => 0,
        );

        $result = match($needles, $haystack);

        $this->assertEquals($expected, $result);
    }
}


And here's the function:

function match($needles, $haystack)
{
    // The default result will containg 0 (zero) occurences for all $haystacks
    $result = array_combine($haystack, array_fill(0, count($haystack), 0));

    foreach ($needles as $needle) {

        foreach ($haystack as $subject) {
            $words = str_word_count($subject, 1); // split into words

            foreach ($words as $word) {
                if (stripos($word, $needle) === 0) {
                    $result[$subject]++;

                    break;
                }
            }
        }
    }

    return $result;
}


Testing that the break statement is necessary

The following test shows when break is necessary. Run this test both with and without a break statement inside the match function.

/**
 * This test demonstrates the purpose of the BREAK statement in the
 * implementation function. Without it, the needle will be matched twice.
 * "hot" will be matched for each "Hotel" word.
 */
public function testMatcherCountsNeedlesNotOccurences2()
{
    $needles  = array('hot');
    $haystack = array('Southern Hotel Hotel');
    $expected = array('Southern Hotel Hotel' => 1);
    $result   = match($needles, $haystack);

    $this->assertEquals($expected, $result);
}
like image 72
Ionuț G. Stan Avatar answered Jan 01 '23 11:01

Ionuț G. Stan