Short version: What's the proper way to store a list of several hundred numbers in Clojure, where each number is being incremented millions of times (potentially across several threads)?
Long version: The program starts with an empty vector where each value is initialized to 0:
[0 0 0 0 0 0 0 0 0 ...]
It then reads a multi-million line file, line-by-line. After performing some arbitrary computation on a line, the program increments some of the values in the vector. After the first line the vector may look like:
[1 1 1 2 0 1 0 1 1 ...]
After the second line:
[2 2 3 2 2 1 0 2 2 ...]
After ~5000 lines it may look something like:
[5000 4998 5008 5002 4225 5098 5002 5043 ...]
Since Clojure's data structures are immutable simply using assoc
to increment a value in the vector seems incredibly wasteful, because the entirety of the vector will be copied for each increment.
What is the correct way to do this kind of concurrent data aggregation without spending all my CPU time copying immutable data structures? Should I have a vector where each element is something like a ref or atom, with all threads incrementing those shared values? Or, is there some kind of thread-level data structure that can store the counts, and then the final step is to consolidate the counts from each thread?
This probably won't be I/O bound on a single thread so I'm guessing I will split line processing across several threads. There is no limit to the length of the vector (it could be several thousand elements long) but more than likely it be around 100 elements long.
Clojure's vector
is persistent data structure. When updating an element within vector, it doesn't copy the entire elements, and takes essentially constant time, which means O(log32 n).
But it seems that you're updating almost every element within the vector each iteration. Perhaps you want to refer Transient Data Structures.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With