After inserting 100000000 elements into my heap and unsorted list, it seems that the heap insertion is actually faster (12 seconds vs 20 seconds). Why is this? I believe heap insertion is O(logn)
while unsorted list insertion is O(1)
. I also noticed that my heap insertion implementation doesn't actually scale with the number of inputs. This also confuses me.
Here is the code that I ran:
int main ()
{
clock_t unsortedStart;
clock_t heapStart;
double unsortedDuration;
double heapDuration;
int num_pushes = 100000000;
int interval = 10000;
ofstream unsorted ("unsorted.txt");
ofstream heap ("heap.txt");
UnsortedPQ<int> unsortedPQ;
HeapPQ<int> heapPQ;
unsortedStart = clock();
for (int i = 0; i < num_pushes; ++i)
{
if (i % interval == 0) {
unsortedDuration = ( clock() - unsortedStart ) / (double) CLOCKS_PER_SEC;
unsorted << unsortedDuration << " " << i << endl;
}
unsortedPQ.insertItem(rand() % 100);
}
heapStart = clock();
for (int i = 0; i < num_pushes; ++i)
{
if (i % interval == 0) {
heapDuration = ( clock() - heapStart ) / (double) CLOCKS_PER_SEC;
heap << heapDuration << " " << i << endl;
}
heapPQ.insertItem(rand() % 100);
}
return 0;
}
This is the heap implementation of insert (uses std::vector
):
template <class T>
void HeapPQ<T>::insertItem(T data) {
//insert into back of heap (std::vector)
dataArray.push_back(data);
int i = dataArray.size() - 1;
//sifts the inserted element up
while (i != 0 && dataArray[(i - 1) / 2] > dataArray[i]) {
swap(dataArray[i], dataArray[(i - 1) / 2]);
i = (i - 1) / 2;
}
}
This is the unsorted list implementation of insert (uses std::list
):
//pushes element to the back of a std::list
template <class T>
void UnsortedPQ<T>::insertItem(T data) { dataList.push_back(data); }
But in all case Insertion sort is very much faster compared to bubble and heap sort. Theoretically heap sort is supposed to be the best in case of worst scenario. Please find the below test result when I used 100000 as the input for a worst case scenario.
Process of Insertion: Elements can be inserted to the heap following a similar approach as discussed above for deletion. The idea is to: First increase the heap size by 1, so that it can store the new element. Insert the new element at the end of the Heap. This newly inserted element may distort the properties of Heap for its parents.
Using a heap to find the smallest element is definitely a lot faster than sorting an array. Two heaps for the smallest and largest element are still a lot faster (but that situation is quite rare; for example in a horse race everyone wants to know the winner, but nobody cares who comes last).
The standard deletion operation on Heap is to delete the element present at the root node of the Heap. That is if it is a Max Heap, the standard deletion operation will delete the maximum element and if it is a Min heap, it will delete the minimum element. Since deleting an element at any intermediary position in the heap can be costly, ...
The insertion into the heap is O(logn)
, that mean every insertion could take at most O(logn)
steps. It does not mean it has to.
In your example average cost of inserting an element is O(1)
. Why that?
For simplicity, let's assume you insert only 0
a and 1
s in a random order (in your current version only numbers 0..99
(rand() % 100
) are inserted - the calculation is more complex, but the behavior stays the same). After 2*n
elements are inserted, there would be about n
0
s and n
1
s in the heap, and the heap would look as follows:
0
0 0
00 00
...............
0 0 0 0 0 0 0
11 11 11 11 11 11 11
So basically, 1
s are all at the last level k
and 0
s are at levels 0..k-1
.
1
is inserted, there is nothing to do (there are no 2
s above). 0
is inserted there is at most one swap (1
s may be in the level above the last level, but 2 levels above).That meas in average we need only 0.5
swaps and not k
.
Having the same asymptotic running time, it is all down to the (amortized) costs for inserting in a vector and in a list. The list seems to be slower (my assumption would be, that for every insert it needs to allocate an element on the heap via new
and this is a quite slow operation. The costs depend on other factors, e.g. the size of the inserted objects, and thus it may vary which one is faster).
Let's take a closer look at your case, where the numbers are generated by a uniform dstribution [0..99]
. After n>>100
insertions we will have the following situation (there is some hand-waving involved, but the gist should be clear):
k
-th) of the heap has n/2
elements and consists of numbers 50..99
. So for 50% of possible numbers (i.e. 50..99
) no shift is needed.k-1
-th) of the heap has n/4
elements and consists of numbers 25..49
. That means for 25% of possible numbers exactly 1 shift is needed.k-2
has n/8
elements and consists of numbers 13..24
.log 100/log 2
have only 0
s inside. So the maximal number of shifts possible is m=log 100/log 2
, independent of n
- the number of elements in the heap.So worst case costs for the insertion would be log 100/log 2
, the average costs are even smaller:
E(insertion)=0*1/2+1*1/4+2*1/8+...<=1.0
i.e. on average we have less than 1 shift per insertion.
NB: It does not mean, that inserting in the heap has amortized costs of O(1)
- if you would insert the numbers not in random order, but first all 99
s, then 98
s, ..., then 0
s you would have costs of O(log n)
per insertion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With