Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Structures: If Heaps are trees, why are they implemented internally with Lists or arrays?

I'm giving myself a refresher course in data structures and algorithms (and learning new stuff -- I was an Information Systems major in college instead of Computer Science, so I didn't get a formal education in these things), and I've been working on heaps. I'm a little confused.

My understanding is that a Heap is basically a semi-sorted tree, where the value of each child node is guaranteed to be less than the value of its parent (assume for this discussion MinHeaps). So if it's a tree, why is it that every implementation I've seen has used an array-like structure internally instead of building a set of tree nodes?

It seems weird to me that I have to memorize that the children of N in an array sit at 2N + 1 (left) and 2N + 2 (right)*. Why not just build a node with Left and Right properties and go from there?

*Source: This article

like image 778
Ari Roth Avatar asked Dec 14 '22 05:12

Ari Roth


2 Answers

TL;DR: Save on memory overhead, get more speed from data locality.

For a binary tree you need in each node 4 bytes for left child and 4 byte for the right child (or 8+8 if you are on 64 bit system). That's just the bare pointers you need. If you store a 32 bit int that's a lot of overhead. Add another pointer for the parent that is needed for pushing a node towards the root and you are looking at 24 bytes of overhead for a 4 byte integer on a 64bit system.

For a heap you don't need to worry about arbitrary trees. You usually only worry about the head (the min/max of the values) and you don't care about the internal structure. The heap is an almost complete binary tree (all levels are filled except the last one which is filled left to right). In this structure if you just put the nodes in an array, then for a node at index x you always find the parent at (x+1)/2 the left child at x*2+1 and the right child at x*2+2. So no need to store any of those fat pointers.

In addition to the space saved you also get a speed boost because the memory is contiguous so it's more likely that it will be cached together (not guaranteed, just more likely).

Sure, if it not something where efficiency is important, you can implement it as a regular tree. And conversely, if you have a almost complete tree and you want to the most out of you system implement it with an array (even if you don't use it as a heap).

like image 176
Sorin Avatar answered Feb 01 '23 23:02

Sorin


First, let's make a little clarification about vocabulary:

  • A (min-oriented, but I won't precise it every time) priority queue is an abstract data structure which implements the operations add and deleteMin and sometimes decreaseKey. Actually, you can make a simple array/list where you loop through the structure to find the minimum, and you would have implemented a priority queue (a very inefficient one, but still).
  • A heap is a tree-based data structure that satisfies the heap property (Wikipedia). The heap property is: a parent has a lower key than its childs.
  • The data structure you are describing is the very common binary heap which is a kind of heap, but not the only one. (Nor the most efficient but that's another story).

The first time I heard of the binary heap, I also thought it was very weird to have a tree in an array, and you'd have to make some strange multiplications to get to the childs/parent.

It's more difficult to represent it in your head, but it makes perfect sense, if you look a little bit closer:

  • A binary heap is nearly balanced, that is there will never be a hole in the array. (This simple property is awesome in itself, because holes in arrays are a real pain).
  • It takes less space: an array is MUCH more memory efficient than nodes.
  • As it is designed, it is quite easy to abstract the array as a binary tree. You can create helpers like getRight(int node), getLeft(int node) and getParent(int node), and the implementation will look more familiar.

However, a binary heap is not cache-efficient, because childs are very far away from the parent, though it may be more cache-efficient than a node-based equivalent binary heap.

Now if you look at the pros and cons, the only con is that the array-based binary heap requires one more step to picture it in one's head, but it's winning everything else.

I don't know if the original heap was designed as an array, but somehow one day someone found this implementation and the array has become the standard for binary heaps.

However, other kind of heaps are implemented with nodes, so it's a special case.

like image 22
T. Claverie Avatar answered Feb 02 '23 00:02

T. Claverie