I need to make a decision about whether to use STM in a Clojure system I am involved with for a system which needs several GB to be stored in a single STM ref.
I would like to hear from anyone who has any advice in using Clojure STM with large indexed datasets to hear their experiences.
I've been using Clojure for some fairly large-scale data processing tasks (definitely gigabytes of data, typically lots of largish Java arrays stored inside various Clojure constructs/STM refs).
As long as everything fits in available memory, you shouldn't have a problem with extremely large amounts of data in a single ref. The ref itself applies only a small fixed amount of STM overhead that is independent of the size of whatever is contained within it.
A nice extra bonus comes from the structural sharing that is built into Clojure's standard data structures (maps, vectors etc.) - you can take a complete copy of a 10GB data structure, change one element anywhere in the structure, and be guaranteed that both data structures will together only require a fraction more than 10GB. This is very helpful, particularly if you consider that due to STM/concurrency you will potentially have several different versions of the data being created at any one time.
The performance isn't going to be any worse or any better than STM involving a single ref with a small dataset. Performance is more hindered by the number of updates to a dataset than the actual size of the dataset.
If you have one writer to the dataset and many readers, then performance will still be quite good. However if you have one reader and many writers, performance will suffer.
Perhaps more information would help us help you more.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With