Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing strings in PHP and removing one of them from the array if they are similar

Tags:

function

php

Let say I have an array like this:

  • Band of Horses - Is There a Ghost
  • Band Of Horses - No One's Gonna Love You
  • Band of Horses - The Funeral
  • Band of Horses - The Funeral (lyrics in description)
  • Band of Horses - Laredo
  • Band Of Horses - Laredo on Letterman 5.20.10
  • Band of Horses - "The Great Salt Lake" Sub Pop Records
  • Band Of Horses - "No One's Gonna Love You"
  • Band of Horses perform Marry Song at Tromso Wedding
  • Band Of Horses - No One's Gonna Love You
  • 'Laredo' by Band of Horses on Q TV
  • Band of Horses, On My Way Back Home
  • Band of Horses - cigarettes wedding bands
  • Band Of Horses - "Cigarettes Wedding Bands"
  • Band Of Horses - I Go To The Barn Because I Like The
  • Our Swords - Band of Horses
  • Band Of Horses - "Marry song"
  • Band of Horses - Monsters
  • Band Of Horses - No One's Gonna Love You

The New Array would have:

  • Band of Horses - Is There a Ghost
  • Band Of Horses - No One's Gonna Love You
  • Band of Horses - The Funeral
  • Band of Horses - Laredo
  • Band of Horses - "The Great Salt Lake" Sub Pop Records
  • Band of Horses, On My Way Back Home
  • Band of Horses - cigarettes wedding bands
  • Band Of Horses - I Go To The Barn Because I Like The
  • Our Swords - Band of Horses
  • Band Of Horses - "Marry song"
  • Band of Horses - Monsters

How would you go about comparing each string to every other string in the list in PHP and if they are similar, remove them.

I consider these similar:

  • Band of Horses - The Funeral
  • Band of Horses - The Funeral (lyrics in description)

Another example:

  • Band of Horses - Laredo
  • Band Of Horses - Laredo on Letterman 5.20.10
like image 614
Cody.Stewart Avatar asked Sep 10 '10 01:09

Cody.Stewart


People also ask

How do you compare two strings are similar?

The equals() method compares two strings, and returns true if the strings are equal, and false if not. Tip: Use the compareTo() method to compare two strings lexicographically.

Can you use == to compare strings in PHP?

The assignment operator assigns the variable on the left to have a new value as the variable on right, while the equal operator == tests for equality and returns true or false as per the comparison results. Example: This example describes the string comparison using the == operator.

How do I check if two strings are similar in PHP?

The strcmp() function compares two strings. Note: The strcmp() function is binary-safe and case-sensitive. Tip: This function is similar to the strncmp() function, with the difference that you can specify the number of characters from each string to be used in the comparison with strncmp().

How does PHP compare strings with comparison operators?

PHP will compare alpha strings using the greater than and less than comparison operators based upon alphabetical order. In the first example, ai comes before i in alphabetical order so the test of > (greater than) is false - earlier in the order is considered 'less than' rather than 'greater than'.


2 Answers

You have multiple options.

For each option you should probably massage the album names before performing the comparisons. You can do this by stripping punctuation, sorting the words in the album name alphabetically (in certain cases), etc.

In each case, when you do a comparison, if you remove one of the album names from the array, then your comparison is order sensitive, unless you make a rule as to which album name to remove. So, it probably makes sense to always remove the longer album name if two album names are compared and found to be "similar."

The main comparison options are

  1. Simple substring comparisons. Check if an album name is inside another. Strip punctuation first and compare case insensitively (see my second code snippet below).

  2. Check album name similarity using levenshtein(). This string comparison is more efficient then similar_text(). You should strip punctuation and order words alphabetically.

  3. Check album name similarity using similar_text(). I had the most luck with this method. In fact I got it to pick the exact album names you wanted (see first code snippet below).

  4. There are various other string comparison functions you can play around with including soundex() and metaphone()

Anyway... here are 2 solutions.

The first uses similar_text()... but it calculates the similarity only after all punctuation has been stripped and words put into alphabetical order and lowercased...... the downside is you have to play around with the threshold similarities... The second uses a simple case insensitive substring test after all punctuation and white space is stripped.

The way both code snippets work is that they use array_walk() to run the compare() function on each album in the array. Then inside the compare() function, I use foreach() to compare the current album to all the other albums. There's ample room to make things more efficient.

Note that I should be using the 3rd argument as a reference in array_walk can someone help me do this? The current work around is a global variable:


Live example (69% similarity threshold)


function compare($value, $key)
{
    global $array; // Should use 3rd argument of compare instead

    $value = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value));
    $value = explode(" ", $value);
    sort($value);
    $value = implode($value);
    $value = preg_replace("/[\s]/", "", $value); // Remove any leftover \s

    foreach($array as $key2 => $value2)
    {
        if ($key != $key2)
        {
            // collapse, and lower case the string            
            $value2 = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value2));
            $value2 = explode(" ", $value2);
            sort($value2);
            $value2 = implode($value2);            
            $value2 = preg_replace("/[\s]/", "", $value2);

              // Set up the similarity
            similar_text($value, $value2, $sim);
            if ($sim > 69)
            {     // Remove the longer album name
                unset($array[ ((strlen($value) > strlen($value2))?$key:$key2) ]);
            }
        }
    }
}
array_walk($array, 'compare');
$array = array_values($array);
print_r($array);

The output of the above is:

Array
(
    [0] => Band of Horses - Is There a Ghost
    [1] => Band Of Horses - No One's Gonna Love You
    [2] => Band of Horses - The Funeral
    [3] => Band of Horses - Laredo
    [4] => Band of Horses - "The Great Salt Lake" Sub Pop Records
    [5] => Band of Horses perform Marry Song at Tromso Wedding
    [6] => Band of Horses, On My Way Back Home
    [7] => Band of Horses - cigarettes wedding bands
    [8] => Band Of Horses - I Go To The Barn Because I Like The
    [9] => Our Swords - Band of Horses
    [10] => Band of Horses - Monsters
)

Note that the short version of Mary's song is missing... so it must have been a false positive against something else, since the long version is still in the list..... but they are precisely the album names you wanted.


The substring method:

Live Example


function compare($value, $key)
{
      // I should be using &$array as a 3rd variable.
      // For some reason couldn't get that to work, so I do this instead.
    global $array;   
      // Take the current album name and remove all punctuation and white space
    $value = preg_replace("/[^a-zA-Z0-9]/", "", $value);        
      // Compare current album to all othes
    foreach($array as $key2 => $value2)
    {
        if ($key != $key2)
        {

              // collapse the album being compared to
            $value2 = preg_replace("/[^a-zA-Z0-9]/", "", $value2);

            $subject = $value2;
            $pattern = '/' . $value . '/i';

              // If there's a much remove the album being compared to
            if (preg_match($pattern, $subject))
            {
                unset($array[$key2]);
            }
        }
    }
}
array_walk($array, 'compare');
$array = array_values($array);
echo "<pre>";
print_r($array);
echo "</pre>";

For your example string the above outputs (it shows 2 that you don't want shown):

Array  
(  
    [0] => Band of Horses - Is There a Ghost  
    [1] => Band Of Horses - No One's Gonna Love You  
    [2] => Band of Horses - The Funeral  
    [3] => Band of Horses - Laredo  
    [4] => Band of Horses - "The Great Salt Lake" Sub Pop Records  
    [5] => Band of Horses perform Marry Song at Tromso Wedding      // <== Oops
    [6] => 'Laredo' by Band of Horses on Q TV                       // <== Oops  
    [7] => Band of Horses, On My Way Back Home  
    [8] => Band of Horses - cigarettes wedding bands  
    [9] => Band Of Horses - I Go To The Barn Because I Like The  
    [10] => Our Swords - Band of Horses  
    [11] => Band Of Horses - "Marry song"  
    [12] => Band of Horses - Monsters  
)
like image 118
Peter Ajtai Avatar answered Oct 18 '22 09:10

Peter Ajtai


You might want to try similar_text, maybe in combination with levenshtein, and through experimentation establish a threshold of what score you consider to be similar enough. Also look at the user discussions for some more hints. You can then loop through the array, compare each element to each other element and remove elements that you deem too similar.

I hope this serves as a start for you. The problem is rather complex, since there are many things that may be considered to have the same content, but have completely different syntax ("Our Swords - Band of Horses" vs. "Band of Horses - Our Swords"). It depends if this rather simplistic solution is good enough for what you're trying to do.

like image 30
deceze Avatar answered Oct 18 '22 09:10

deceze