Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

First steps in Swift, performance issue when allocating small objects in a BST

In trying to learn Swift 2.2, I am facing a serious performance drop when trying to allocate many small objects (fundamentally, a BST of 262144 elements). My current benchmark is a Java 1.8.0_74 compile of an old snip that I wrote some years ago, which, on my 2012 Retina Macbook Pro executes in 59 seconds (59036178 microsecond). The Issue I can observe via Instruments is that I get literally dozens of swift_retain_ and swift_release per iteration. Not sure how to avoid them:

import Foundation
import Darwin;

import Foundation

public class BinarySearchTree<T : Comparable> {
    private var _value : T?;

    private var _leftTree : BinarySearchTree<T>?;
    private var _rightTree : BinarySearchTree<T>?;

    public init(value : T) {
        _value = value;
    }

    var value : T? {
        get {
            return self._value;
        }
        set {
            self._value = newValue;
        }
    }

    var leftTree : BinarySearchTree<T>? {
        get {
            return self._leftTree;
        }
        set {
            self._leftTree = newValue;
        }
    }

    var rightTree : BinarySearchTree<T>? {
        get {
            return self._rightTree;
        }
        set {
            self._rightTree = newValue;
        }
    }

    public func add(newValue : T) -> BinarySearchTree<T> {
        var navigator : BinarySearchTree<T>?;
        var subtree : BinarySearchTree<T>?;

        var done : Bool?;

        done = false;
        navigator = self;

        while (!done!) {
            if (newValue < navigator?.value) {
                subtree = navigator?.leftTree;
                if (subtree != nil) {
                    navigator = subtree;
                } else {
                    let newNode = BinarySearchTree<T>(value: newValue);
                    navigator!.leftTree = newNode;
                    done = true;
                }
            } else if (newValue > navigator?.value) {
                subtree = navigator?.rightTree;
                if (subtree != nil) {
                    navigator = subtree;
                } else {
                    let newNode = BinarySearchTree<T>(value: newValue);
                    navigator?.rightTree = newNode;
                    done = true;
                }
            } else {
                done = true;
            }
        }
        return self;
    }
} /* cut remove/search methods */

And this is the test code I wrote for the test run

let count : Int32 = 262144;
let base : Int32 = 65536;
let target : Int32 = count + 1;

var info = mach_timebase_info(numer:0, denom:0);
var timebase = mach_timebase_info(&info);
let numer = UInt64(info.numer);
let denom = UInt64(info.denom);
let norm = UInt64(numer/denom);

let check1 = (mach_absolute_time() * norm);

var root = BinarySearchTree<Int32>(value:base);

for var loop in 0 ... count-1 {
    if (loop % 1000 == 0) {
        print(loop);
    }
    root = root.add(loop);
}

let check2 = (mach_absolute_time() * norm);
print("Creation phase microseconds: [" + String((check2 - check1) / 1000) + "]");

I tried searching for the specific swift release/retain issue with no luck, I am not sure how to proceed. Thanks everyone

like image 805
IridiumFX Avatar asked Apr 17 '16 12:04

IridiumFX


1 Answers

The issue, as you note is the retain/release (though it isn't really, retain/release is insignificant next to the power of ummm... we'll get there at the end). This isn't really related to allocations. You're not allocating extra objects, you're just retaining them briefly and then releasing them. I'll start with Kenneth's code, which optimizes out a lot of performance problems in the original, but still has this issue. (I'm not considering the recursive code because it crashes in your current use case. It does dodge some redundant retains, though.)

It is worth saying that Kenneth's code is good and is generally the way you should do things (as you'll see even more as we go along).

First note: when you mentioned -Ofast, that's for ObjC, not Swift. The flag for Swift is just -O. You also want -whole-module-optimization, but that doesn't really help anything here.

One more little thing, then we're get to it. Mark classes final any time you can. This makes sure there's no dynamic dispatch. This doesn't matter very much here compared the retain/release, but hey, take the easy stuff.

Does 30% off sound good?

OK, now a big one, and it's a trick. I find that I can knock out about 30% of the time (from ~6min to ~4min for the full import) by rewriting this:

guard let subtree = navigator.leftTree else {
    navigator.leftTree = BinarySearchTree<T>(value: newValue)
    break
}
navigator = subtree
continue

As this:

let subtree = navigator.leftTree
if subtree == nil {
    navigator.leftTree = BinarySearchTree(value: newValue)
    break
}
navigator = subtree!
continue

This is a something to be very careful with. It turns out to be faster in this case, but that may not be as fast across other inputs. That may not be as fast across changes to the optimizer (the SIL generation is a bit weird, and I suspect may actually be a mistake because it seems to double-retain navigator in the second case, but only after the if has succeeded). But it does seem to currently be faster. (EDIT: The Swift team was surprised by this finding, and there is now a bug opened against it. Do not expect this to work in the future.)

How about 85% How's that sound?

But like you said, couldn't we avoid all this with structs? But it'd be insanely expensive to copy the whole tree every time we touch it. Of course we could dramatically improve that with copy-on-write like Array uses. But COW is pretty complicated. If only there were a way to reuse the existing stuff. What if we use Array?

private struct Node<Element: Comparable> {
    let value: Element
    var leftIndex = -1 // Ugly, but ~25% faster than using Int? in my tests
    var rightIndex = -1
    init(_ value: Element) { self.value = value }
}

// This works exactly the same if you make it a `final class`. Your choice.
public struct BinarySearchTree<Element: Comparable> {
    private var storage: [Node<Element>] = []

    init(value: Element) { storage.append(Node(value)) }

    public mutating func add(newValue: Element) {
        if storage.isEmpty {
            storage.append(Node(newValue))
        }

        var index = 0

        while (true) {
            let node = storage[index]
            if (newValue < node.value) {
                if node.leftIndex < 0 {
                    storage.append(Node(newValue))
                    storage[index].leftIndex = storage.count - 1 // Don't use node here; remember value types!
                    break
                }
                index = node.leftIndex
                continue
            } else if (newValue > node.value) {
                if node.rightIndex < 0 {
                    storage.append(Node(newValue))
                    storage[index].rightIndex = storage.count - 1
                    break
                }
                index = node.rightIndex
                continue
            } else {
                break
            }
        }
    }
}

This takes ~45s to run on my system. Of course this makes delete a bit more complicated. You'd either have to accept "leaked" memory (possibly with periodic repacking), or you'd need to maintain a freelist. But a freelist wouldn't be too hard to add.

Let's try 99.97% improvement with no changes to add().

And of course it's important to remember that this is a nearly pathological case for BST. Even if you were often handed data in order, you'd be better off applying a shuffle prior to inserting it, even including the cost of the shuffle. For example, using shuffleInPlace (and counting its time), inserting exactly the same values:

var values = Array(0 ... count - 1)
values.shuffleInPlace()
for (loop, value) in values.enumerate() {
    if (loop % 1000 == 0) {
        print(loop)
    }
    root.add(value)
}

This takes us from 45s to about 0.1s. (Kenneth's version and my "!" version are about about 0.2s under this metric; I'd probably use Kenneth's solution, with final added. Even your original code, which has a lot of inefficiencies that Kenneth fixed, only takes 0.5s with this change. Remember, the Kenneth-optimized version with in-order add was 6 minutes on my system.)

It's worth it to shuffle before inserting. If you get things over time, it could be worth it to batch them up and shuffle them before inserting. If the tree changes over time, it'd be worth checking if it's gotten too deep and rebuilding it periodically. Keeping the tree depth reasonable overwhelms every other optimization. Clever ways of working around Swift memory management couldn't touch this one change.

Fix the algorithm. Everything else is peanuts in comparison.

like image 142
Rob Napier Avatar answered Oct 22 '22 21:10

Rob Napier