Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does array_udiff use a compare function instead of a predicate function?

Tags:

arrays

php

array_udiff computes the difference between two arrays using a callback function. However, it requires a compare function instead of a predicate function.

A compare function compares item A relative to item B. A predicate function would just determine whether or not item A is equal to item B.

Compare functions are usually required by sort functions to determine the correct ordering. Since array_udiff is just computing the differences, a predicate function that determines whether each pair is equal seems like it should suffice.

Why does array_udiff use a compare function instead of a predicate function? Does it matter if I use a predicate instead? i.e. Can I elect to just use the 0 and 1 return values to denote inequality and equality, discarding the -1 possibility? What adverse effect, if any, would this have on my results?

like image 518
Quolonel Questions Avatar asked Jan 12 '16 16:01

Quolonel Questions


Video Answer


1 Answers

The implementation for php_array_diff() (which supplies implementation for a number of userspace array functions) works by reusing a number of internal comparison functions.

This is because those comparison functions already exist for other purposes, and meet the needed task at hand: determine if two items are equal or not. That they do a little extra work is unimportant; what's important is the relative reduction in code that needs to be considered. (An equals function could easily be written in terms of a comparison function, or as a separate entity, but now you've got two functions to do the same job.)

The actual implementation also works by sorting. So you need to use a comparison algorithm suitable for sorting, or you will get unexpected results. For example:

$a = [0, 1, 2, 3, 4, 5, 6];
$b = [4];

print_r(array_udiff($a, $b, function($x, $y) {
    return $x <=> $y; //Sorting comparison function, correct
}));

print_r(array_udiff($a, $b, function($x, $y) {
    return $x != $y; // Equality test, incorrect
}));

gives

Array //Sorting comparison function, correct
(
    [0] => 0
    [1] => 1
    [2] => 2
    [3] => 3
    [5] => 5
    [6] => 6
)
Array // Equality test, incorrect
(
    [0] => 0
    [1] => 1
    [2] => 2
    [3] => 3
    [4] => 4 // equality test causes udiff to incorrectly include 4
    [5] => 5
    [6] => 6
)

The reason for this is the algorithm php_array_diff() uses. Basically it goes like this:

  • Duplicate and sort all of the input arrays
  • Set the output OUT equal to the sorted first input array
  • For each element in SRC
    • V is the value of the current element in SRC
    • For each input array A starting from the second
      • Skip ahead to the next element in A that is > V, but make a note if we go past one that is == V.
      • If we found a match for V, remove it from OUT.
      • If we don't (so it stays in the input array), skip ahead in SRC until we have a new V >= the current one

So, the algorithm relies on all of the inputs being sorted, and uses that fact, (and the the comparison function) so it only has to inspect each element in each input array once. If the comparison function doesn't result in an actually sorted array, the algorithm fails and you get a bad result.


HHVM may result in a different result, as HHVM uses a different sorting algorithm. HHVM uses a pure quicksort, while PHP uses a quicksort implementation derived from llvm that includes an insertion sort optimization.

Normally, different sorting algorithms arrive at the same solution via different means. That is, different algorithms cause elements to be compared in a different order, at different times, and in different quantities. In the case of an incorrect comparison function, this can have a large effect on the final order of the array.

like image 120
jbafford Avatar answered Nov 10 '22 15:11

jbafford