How does similar_text work?

Tags:

I just found the similar_text function and was playing around with it, but the percentage output always suprises me. See the examples below.

I tried to find information on the algorithm used as mentioned on php: similar_text()^Docs:

<?php $p = 0; similar_text('aaaaaaaaaa', 'aaaaa', $p); echo $p . "<hr>"; //66.666666666667 //Since 5 out of 10 chars match, I would expect a 50% match  similar_text('aaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p); echo $p . "<hr>"; //40 //5 out of 20 > not 25% ?  similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p); echo $p . "<hr>";  //9.5238095238095  //5 out of 100 > not 5% ?   //Example from PHP.net //Why is turning the strings around changing the result?  similar_text('PHP IS GREAT', 'WITH MYSQL', $p); echo $p . "<hr>"; //27.272727272727  similar_text('WITH MYSQL', 'PHP IS GREAT', $p); echo $p . "<hr>"; //18.181818181818  ?>

Can anybody explain how this actually works?

Update:

Thanks to the comments I found that the percentage is actually calculated using the number of similar charactors * 200 / length1 + lenght 2

Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);

So that explains why the percenatges are higher then expected. With a string with 5 out of 95 it turns out 10, so that I can use.

similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p); echo $p . "<hr>";  //10 //5 out of 95 = 5 * 200 / (5 + 95) = 10

But I still cant figure out why PHP returns a different result on turning the strings around. The JS code provided by dfsq doesn't do this. Looking at the source code in PHP I can only find a difference in the following line, but i'm not a c programmer. Some insight in what the difference is, would be appreciated.

In JS:

for (l = 0;(p + l < firstLength) && (q + l < secondLength) && (first.charAt(p + l) === second.charAt(q + l)); l++);

In PHP: (php_similar_str function)

for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);

Source:

/* {{{ proto int similar_text(string str1, string str2 [, float percent])    Calculates the similarity between two strings */ PHP_FUNCTION(similar_text) {   char *t1, *t2;   zval **percent = NULL;   int ac = ZEND_NUM_ARGS();   int sim;   int t1_len, t2_len;    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|Z", &t1, &t1_len, &t2, &t2_len, &percent) == FAILURE) {     return;   }    if (ac > 2) {     convert_to_double_ex(percent);   }    if (t1_len + t2_len == 0) {     if (ac > 2) {       Z_DVAL_PP(percent) = 0;     }      RETURN_LONG(0);   }    sim = php_similar_char(t1, t1_len, t2, t2_len);    if (ac > 2) {     Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);   }    RETURN_LONG(sim); } /* }}} */    /* {{{ php_similar_str  */ static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max) {   char *p, *q;   char *end1 = (char *) txt1 + len1;   char *end2 = (char *) txt2 + len2;   int l;    *max = 0;   for (p = (char *) txt1; p < end1; p++) {     for (q = (char *) txt2; q < end2; q++) {       for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);       if (l > *max) {         *max = l;         *pos1 = p - txt1;         *pos2 = q - txt2;       }     }   } } /* }}} */   /* {{{ php_similar_char  */ static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2) {   int sum;   int pos1, pos2, max;    php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);    if ((sum = max)) {     if (pos1 && pos2) {       sum += php_similar_char(txt1, pos1,                   txt2, pos2);     }     if ((pos1 + max < len1) && (pos2 + max < len2)) {       sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,                   txt2 + pos2 + max, len2 - pos2 - max);     }   }    return sum; } /* }}} */

Source in Javascript: similar text port to javascript

279

asked Jan 03 '13 09:01

Hugo Delsing

2 Answers

This was actually a very interesting question, thank you for giving me a puzzle that turned out to be very rewarding.

Let me start out by explaining how similar_text actually works.

Similar Text: The Algorithm

It's a recursion based divide and conquer algorithm. It works by first finding the longest common string between the two inputs and breaking the problem into subsets around that string.

The examples you have used in your question, actually all perform only one iteration of the algorithm. The only ones not using one iteration and the ones giving different results are from the php.net comments.

Here is a simple example to understand the main issue behind simple_text and hopefully give some insight into how it works.

Similar Text: The Flaw

eeeefaaaaafddddd ddddgaaaaagbeeee  Iteration 1: Max    = 5 String = aaaaa Left : eeeef and ddddg Right: fddddd and geeeee

I hope the flaw is already apparent. It will only check directly to the left and to the right of the longest matched string in both input strings. This example

$s1='eeeefaaaaafddddd'; $s2='ddddgaaaaagbeeee';  echo similar_text($s1, $s2).'|'.similar_text($s2, $s1); // outputs 5|5, this is due to Iteration 2 of the algorithm // it will fail to find a matching string in both left and right subsets

To be honest, I'm uncertain how this case should be treated. It can be seen that only 2 characters are different in the string. But both eeee and dddd are on opposite ends of the two strings, uncertain what NLP enthusiasts or other literary experts have to say about this specific situation.

Similar Text: Inconsistent results on argument swapping

The different results you were experiencing based on input order was due to the way the alogirthm actually behaves (as mentioned above). I'll give a final explination on what's going on.

echo similar_text('test','wert'); // 1 echo similar_text('wert','test'); // 2

On the first case, there's only one Iteration:

test wert  Iteration 1: Max    = 1 String = t Left :  and wer Right: est and

We only have one iteration because empty/null strings return 0 on recursion. So this ends the algorithm and we have our result: 1

On the second case, however, we are faced with multiple Iterations:

wert test  Iteration 1: Max    = 1 String = e Left : w and t Right: rt and st

We already have a common string of length 1. The algorithm on the left subset will end in 0 matches, but on the right:

rt st  Iteration 1: Max    = 1 String = t Left : r and s Right:  and

This will lead to our new and final result: 2

I thank you for this very informative question and the opportunity to dabble in C++ again.

Similar Text: JavaScript Edition

The short answer is: The javascript code is not implementing the correct algorithm

sum += this.similar_text(first.substr(0, pos2), second.substr(0, pos2));

Obviously it should be first.substr(0,pos1)

Note: The JavaScript code has been fixed by eis in a previous commit. Thanks @eis

Demystified!

116

answered Oct 09 '22 03:10

Khez

It would indeed seem the function uses different logic depending of the parameter order. I think there are two things at play.

First, see this example:

echo similar_text('test','wert'); // 1 echo similar_text('wert','test'); // 2

It seems to be that it is testing "how many times any distinct char on param1 is found in param2", and thus result would be different if you swap the params around. It has been reported as a bug, which has been closed as "working as expected".

Now, the above is the same for both PHP and javascript implementations - paremeter order has an impact, so saying that JS code wouldn't do this is wrong. This is argued in the bug entry as intended behaviour.

Second - what doesn't seem correct is the MYSQL/PHP word example. With that, javascript version gives 3 irrelevant of the order of params, whereas PHP gives 2 and 3 (and due to that, percentage is equally different). Now, the phrases "PHP IS GREAT" and "WITH MYSQL" should have 5 characters in common, irrelevant of which way you compare: H, I, S and T, one each, plus one for empty space. In order they have 3 characters, 'H', ' ' and 'S', so if you look at the ordering, correct answer should be 3 both ways. I modified the C code to a runnable version, and added some output, so one can see what is happening there (codepad link):

#include<stdio.h>  /* {{{ php_similar_str  */ static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max) {   char *p, *q;   char *end1 = (char *) txt1 + len1;   char *end2 = (char *) txt2 + len2;   int l;    *max = 0;   for (p = (char *) txt1; p < end1; p++) {     for (q = (char *) txt2; q < end2; q++) {       for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);       if (l > *max) {         *max = l;         *pos1 = p - txt1;         *pos2 = q - txt2;       }     }   } } /* }}} */   /* {{{ php_similar_char  */ static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2) {   int sum;   int pos1, pos2, max;    php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);    if ((sum = max)) {     if (pos1 && pos2) {       printf("txt here %s,%s\n", txt1, txt2);       sum += php_similar_char(txt1, pos1,                   txt2, pos2);     }     if ((pos1 + max < len1) && (pos2 + max < len2)) {       printf("txt here %s,%s\n", txt1+ pos1 + max, txt2+ pos2 + max);       sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,                   txt2 + pos2 + max, len2 - pos2 - max);     }   }    return sum; } /* }}} */ int main(void) {     printf("Found %d similar chars\n",         php_similar_char("PHP IS GREAT", 12, "WITH MYSQL", 10));     printf("Found %d similar chars\n",         php_similar_char("WITH MYSQL", 10,"PHP IS GREAT", 12));     return 0; }

the result is output:

txt here PHP IS GREAT,WITH MYSQL txt here P IS GREAT, MYSQL txt here IS GREAT,MYSQL txt here IS GREAT,MYSQL txt here  GREAT,QL Found 3 similar chars txt here WITH MYSQL,PHP IS GREAT txt here TH MYSQL,S GREAT Found 2 similar chars

So one can see that on the first comparison, the function found 'H', ' ' and 'S', but not 'T', and got the result of 3. The second comparison found 'I' and 'T' but not 'H', ' ' or 'S', and thus got the result of 2.

The reason for these results can be seen from the output: algorithm takes the first letter in the first string that second string contains, counts that, and throws away the chars before that from the second string. That is why it misses the characters in-between, and that's the thing causing the difference when you change the character order.

What happens there might be intentional or it might not. However, that's not how javascript version works. If you print out the same things in the javascript version, you get this:

txt here: PHP, WIT txt here: P IS GREAT,  MYSQL txt here: IS GREAT, MYSQL txt here: IS, MY txt here:  GREAT, QL Found 3 similar chars txt here: WITH, PHP  txt here: W, P txt here: TH MYSQL, S GREAT Found 3 similar chars

showing that javascript version does it in a different way. What the javascript version does is that it finds 'H', ' ' and 'S' being in the same order in the first comparison, and the same 'H', ' ' and 'S' also on the second one - so in this case the order of params doesn't matter.

As the javascript is meant to duplicate the code of PHP function, it needs to behave identically, so I submitted bug report based on analysis of @Khez and the fix, which has been merged now.

answered Oct 09 '22 03:10

eis

Related questions
                            
                                str_replace in Twig
                            
                                Generating UNIQUE Random Numbers within a range
                            
                                Visual Studio Code: Unable to locate phpcs
                            
                                PHP | "The requested PHP extension bcmath is missing from your system."
                            
                                PHP read sub-directories and loop through files how to?
                            
                                Notification system using php and mysql
                            
                                Turn off warnings and errors on PHP and MySQL
                            
                                Woocommerce, get current product id
                            
                                Show image using file_get_contents
                            
                                PHP regular expression - filter number only
                            
                                PHP CSV string to array
                            
                                CodeIgniter: Unable to connect to your database server using the provided settings Error Message
                            
                                Check if request is GET or POST
                            
                                ReCaptcha 2.0 With AJAX
                            
                                mod_php vs cgi vs fast-cgi
                            
                                Find total number of results in mySQL query with offset+limit
                            
                                How can I mount an S3 bucket to an EC2 instance and write to it with PHP?
                            
                                PHP header() redirect with POST variables [duplicate]
                            
                                How can I calculate the SHA-256 hash of a string in Android?
                            
                                Composer - The requested package exists as but these are rejected by your constraint

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With