Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Arrays - Remove duplicates ( Time complexity )

Okay this is not a question of "how to get all uniques" or "How to remove duplicates from my array in php". This is a question about the time complexity.

I figured that the array_unique is somewhat O(n^2 - n) and here's my implementation:

function array_unique2($array) 
{ 
    $to_return = array(); 
    $current_index = 0;

    for ( $i = 0 ; $i < count($array); $i++ ) 
    { 
        $current_is_unique = true; 

        for ( $a = $i+1; $a < count($array); $a++ ) 
        { 
            if ( $array[$i] == $array[$a] ) 
            { 
                $current_is_unique = false; 
                break; 
            } 
        } 
        if ( $current_is_unique ) 
        { 
            $to_return[$current_index] = $array[$i];
        } 

    } 

    return $to_return; 
}

However when benchmarking this against the array_unique i got the following result:

Testing (array_unique2)... Operation took 0.52146291732788 s.

Testing (array_unique)... Operation took 0.28323101997375 s.

Which makes the array_unique twice as fast, my question is, why ( Both had the same random data ) ?

And a friend of mine wrote the following:

function array_unique2($a)
{
    $n = array();
    foreach ($a as $k=>$v)
        if (!in_array($v,$n))
            $n[$k]=$v;
    return $n;
}

which is twice as fast as the built in one in php.

I'd like to know, why?

What is the time-complexity of array_unique and in_array?

Edit I removed the count($array) from both loops and just used a variable in the top of the function, that gained 2 seconds on 100 000 elements!

like image 831
Filip Ekberg Avatar asked Jan 25 '09 18:01

Filip Ekberg


People also ask

How do you remove duplicates from an array in PHP?

The array_unique() function removes duplicate values from an array. If two or more array values are the same, the first appearance will be kept and the other will be removed.

How are duplicates removed from an array without using any library PHP?

An array is defined and duplicate elements from the array can be found and removed using the 'array_flip' function, that basically reverses the keys/index as values and values as keys.

How do you remove duplicate values from an array?

We can remove duplicate element in an array by 2 ways: using temporary array or using separate index. To remove the duplicate element from array, the array must be in sorted order. If array is not sorted, you can sort it by calling Arrays.sort(arr) method.


2 Answers

While I can't speak for the native array_unique function, I can tell you that your friends algorithm is faster because:

  1. He uses a single foreach loop as opposed to your double for() loop.
  2. Foreach loops tend to perform faster than for loops in PHP.
  3. He used a single if(! ) comparison while you used two if() structures
  4. The only additional function call your friend made was in_array whereas you called count() twice.
  5. You made three variable declarations that your friend didn't have to ($a, $current_is_unique, $current_index)

While none of these factors alone is huge, I can see where the cumulative effect would make your algorithm take longer than your friends.

like image 63
Noah Goodrich Avatar answered Oct 12 '22 01:10

Noah Goodrich


The time complexity of in_array() is O(n). To see this, we'll take a look at the PHP source code.

The in_array() function is implemented in ext/standard/array.c. All it does is call php_search_array(), which contains the following loop:

while (zend_hash_get_current_data_ex(target_hash, (void **)&entry, &pos) == SUCCESS) {

    // checking the value...

    zend_hash_move_forward_ex(target_hash, &pos);
}

That's where the linear characteristic comes from.

This is the overall characteristic of the algorithm, becaus zend_hash_move_forward_ex() has constant behaviour: Looking at Zend/zend_hash.c, we see that it's basically just

*current = (*current)->pListNext;

As for the time complexity of array_unique():

  • first, a copy of the array will be created, which is an operation with linear characteristic
  • then, a C array of struct bucketindex will be created and pointers into our array's copy will be put into these buckets - linear characteristic again
  • then, the bucketindex-array will be sorted usign quicksort - n log n on average
  • and lastly, the sorted array will be walked and and duplicate entries will be removed from our array's copy - this should be linear again, assuming that deletion from our array is a constant time operation

Hope this helps ;)

like image 33
Christoph Avatar answered Oct 12 '22 01:10

Christoph