Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Stop Word List

I'm playing about with a stop words within my code I have an array full of words that I'd like to check, and an array of words I want to check against.

At the moment I'm looping through the array one at at a time and removing the word if its in_array vs the stop word list but I wonder if there's a better way of doing it, I've looked at array_diff and such however if I have multiple stop words in the first array, array_diff only appears to remove the first occurrence.

The focus is on speed and memory usage but speed more so.

Edit -

The first array is singular words, based on blog comments (these are usually quite long) the second array is singular words of stop words. Sorry for not making that clear

Thanks

like image 909
Dom Hodgson Avatar asked May 02 '10 08:05

Dom Hodgson


3 Answers

Using str_replace...

A simple approach is to use str_replace or str_ireplace, which can take an array of 'needles' (things to search for), corresponding replacements, and an array of 'haystacks' (things to operate on).

$haystacks=array(
  "The quick brown fox",
  "jumps over the ",
  "lazy dog"
);

$needles=array(
  "the", "lazy", "quick"
);

$result=str_ireplace($needles, "", $haystacks);

var_dump($result);

This produces

array(3) {
  [0]=>
  string(11) "  brown fox"
  [1]=>
  string(12) "jumps over  "
  [2]=>
  string(4) " dog"
}

As an aside, a quick way to clean up the trailing spaces this leaves would be to use array_map to call trim for each element

$result=array_map("trim", $result);

The drawback of using str_replace is that it will replace matches found within words, rather than just whole words. To address that, we can use regular expressions...

Use preg_replace

An approach using preg_replace looks very similar to the above, but the needles are regular expressions, and we check for a 'word boundary' at the start and end of the match using \b

$haystacks=array(
"For we shall use fortran to",
"fortify the general theme",
"of this torrent of nonsense"
);

$needles=array(
  '/\bfor\b/i', 
  '/\bthe\b/i', 
  '/\bto\b/i', 
  '/\bof\b/i'
);

$result=preg_replace($needles, "", $haystacks);
like image 151
Paul Dixon Avatar answered Oct 04 '22 00:10

Paul Dixon


array_diff() should work.

$sentence = "the quick brown fox jumps the fence and runs";
$array = explode(" ", $sentence);
$stopwords = array("the","and","an","of");

print_r(array_diff($array,$stopwords));

Result

Array
(
    [1] => quick
    [2] => brown
    [3] => fox
    [4] => jumps
    [6] => fence
    [8] => runs
)

I tested on this site: http://sandbox.onlinephpfunctions.com/

like image 22
anubina Avatar answered Oct 04 '22 01:10

anubina


If you already have two sorted arrays, you can use this algorithm to remove each element from array A that is also in array B (in mathematical terms: A \ B):

for ($i=0, $n=count($a), $j=0, $m=count($b); $i<$n && $j<$m; ) {
    $diff = strcmp($a[$i], $b[$j]);
    if ($diff == 0) {
        unset($a[$i]);
        $i++;
    }
    if ($diff < 0) {
        $i++;
    }
    if ($diff > 0) {
        $j++;
    }
}

This does only require O(n) steps.

Another approach would be to use the words of array B as keys for an index (using array_flip), iterate the values of A and see if they are a key in the index using array_key_exists:

$index = array_flip($b);
foreach ($a as $key => $val) {
    if (array_key_exists($val, $b)) {
        unset($a[$key]);
    }
}

Again, this is O(n) as it avoids looking up each value in B for each value in A that would be O(n2).

like image 34
Gumbo Avatar answered Oct 04 '22 00:10

Gumbo