In Haskell, I have a container like:
data Container a = Container { length :: Int, buffer :: Unboxed.Vector (Int,a) }
This container is a flattened tree. Its accessor (!)
performs a binary (log(N)
) search through the vector in order to find the right bucket where index
is stored.
(!) :: Container a -> Int -> a
container ! index = ... binary search ...
Since consecutive accesses are likely to be in the same bucket, this could be optimized in the following way:
if `index` is on the the last accessed bucket, skip the search
The tricky point is the last accessed bucket
part. In JavaScript, I'd just impurely modify a hidden variable on the container object.
function read(index,object){
var lastBucket = object.__lastBucket;
// if the last bucket contains index, no need to search
if (contains(object, lastBucket, index))
var bucket = lastBucket;
// if it doesn't
else {
// then we search the bucket
var bucket = searchBucket(index,object);
// And impurely annotate it on the container, so the
// next time we access it we could skip the search.
container.__lastBucket = bucket;
}
return object.buffer[bucket].value;
}
Since this is just an optimization and the result is the same independent of the branch taken, I believe it doesn't break referential transparency. How is it possible, in Haskell, to impurely modify an state associated with a runtime value?
~
I have thought in 2 possible solutions.
A global, mutable hashmap linking pointers to the lastBucket
value, and use unsafePerformIO to write on it. But I'd need a way to get the runtime pointer of an object, or at least an unique id of some sort (how?).
Add an extra field to Container
, lastBucket :: Int
, and somehow impurely modify it within (!)
, and consider that field internal (because it obviously break referential transparency).
First, an impure function is a function that contains one or more side effects. In the snippet above, updateMyName() is an impure function because it contains code ( myNames ) that mutates an external state — which gives updateMyName() some side effects.
An impure function is a function that mutates variables/state/data outside of it's lexical scope, thus deeming it “impure” for this reason. There are many ways to write JavaScript, and thinking in terms of impure/pure functions we can write code that is much easier to reason with.
An impure function is a function that contains one or more side effects. It mutates data outside of its lexical scope and does not predictably produce the same output for the same input.
An impure function is a function that contains one or more side effects. A pure function is a function without any side effects.
Using solution (1), I managed to get the following design. First, I added a __lastAccessedBucket :: IORef Int
field to my datatype, as suggested by @Xicò:
data Container a = Container {
length :: Int,
buffer :: V.Vector (Int,a),
__lastAccessedBucket :: IORef Int }
Then, I had to update the functions that create a new Container
in order to create a new IORef using unsafePerformIO
:
fromList :: [a] -> Container a
fromList list = unsafePerformIO $ do
ref <- newIORef 0
return $ Container (L.length list) buffer ref
where buffer = V.fromList (prepare list)
Finally, I created two new functions, findBucketWithHint
, a pure function which searches the bucket of an index with guess (i.e., the bucket where you think it might be), and the unsafeFindBucket
function, which replaces the pure findBucket
when performance is needed, by always using the last accessed bucket as the hint:
unsafeFindBucket :: Int -> Container a -> Int
unsafeFindBucket findIdx container = unsafePerformIO $ do
let lastBucketRef = __lastAccessedBucket contianer
lastBucket <- readIORef lastBucketRef
let newBucket = findBucketWithHint lastBucket findIdx container
writeIORef lastBucketRef newBucket
return $ newBucket
With this, unsafeFindBucket
is technically a pure function with the same API of the original findBucket
function, but is an order of magnitude faster in some benchmarks. I have no idea how safe this is and where it could cause bugs. Threads are certainly a concern.
(This is more an extended comment than an answer.)
First I'd suggest to check if this isn't a case of premature optimization. After all, O(log n) ins't that bad.
If this part is indeed performance-critical, your intention is definitely valid. The usual warning for unsafePerformIO
is "use it only if you know what you're doing", which you obviously do, and it can help to make things pure and fast at the same time.
Be sure that you follow all the precautions in the docs, in particular setting the proper compiler flags (you might want to use the OPTIONS_GHC
pragma).
Also make sure that the IO
operation is thread safe. The easiest way to ensure that is to use IORef
together with atomicModifyIORef
.
The disadvantage of an internal mutable state is that the performance of the cache will deteriorate if it's accessed from multiple threads, if they lookup different elements.
One remedy would be to explicitly thread the updated state instead of using the internal mutable state. This is obviously what you want to avoid, but if your program is using monads, you could just add another monadic layer that'd internally keep the state for you and expose the lookup operation as a monadic action.
Finally, you could consider using splay trees instead of the array. You'd still have (amortized) O(log n) complexity, but their big advantage is that by design they move frequently accessed elements near the top. So if you'll be accessing a subset of elements of size k, they'll be soon moved to the top, so the lookup operations will be just O(log k) (constant for a single, repeatedly accessed element). Again, they update the structure on lookups, but you could use the same approach with unsafePerformIO
and atomic updates of IORef
to keep the outer interface pure.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With