Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Extract Similar Parts from Multiple Strings

Tags:

string

php

I'm trying to extract the parts which are similar from multiple strings.

The purpose of this is an attempt to extract the title of a book from multiple OCRings of the title page.

This applies to only the beginning of the string, the ends of the strings don't need to be trimmed and can stay as they are.

For example, my strings might be:

$title[0]='the history of the internet, expanded and revised';
$title[1]='the history of the internet';
$title[2]='published by xyz publisher the historv of the internot, expanded and';
$title[3]='history of the internet';

So basically I would want to trim each string so that it starts at the most probable starting point. Considering that there may be OCR errors (e.g. "historv", "internot") I thought it might be best to take the number of characters from each word, which would give me an array for each string (so a multi-dimensional array) with a the length of each word. This can then be used to find running matches and trim the beginnings of the string to the most likely.

The strings should be cut to:

$title[0]='the history of the internet, expanded and revised';
$title[1]='the history of the internet';
$title[2]='the historv of the internot, expanded and';
$title[3]='XXX history of the internet';

So I need to be able to recognize that "history of the internet" (7 2 3 8) is the run which matches all strings, and that the preceding "the" is most probably correct seeing as it occurs in >50% of the strings, and therefore the beginning of each string is trimmed to "the" and a placeholder of the same length is added onto the string missing "the".

So far I have got:

function CompareSimilarStrings($array)
    {
    $n=count($array);

    // Get length of each word in each string >
    for($run=0; $run<$n; $run++)
        {
        $temp=explode(' ',$array[$run]);
        foreach($temp as $key => $val)
         $len[$run][$key]=strlen($val);
        }

    for($run=0; $run<$n; $run++)
        {

        }
    }

As you can see, I'm stuck on finding the running matches.

Any ideas?

like image 779
Alasdair Avatar asked Feb 24 '12 04:02

Alasdair


1 Answers

You should look into Smith-Waterman algorithm for local string alignment. It is a dynamic programming algorithm which finds parts of the string which are similar in that they have low edit distance.

So if you want to try it out, here is a php implementation of the algorithm.

like image 88
gintas Avatar answered Oct 19 '22 22:10

gintas