Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clojure - Incrementing numbers in a list efficiently and concurrently

Short version: What's the proper way to store a list of several hundred numbers in Clojure, where each number is being incremented millions of times (potentially across several threads)?

Long version: The program starts with an empty vector where each value is initialized to 0:

[0 0 0 0 0 0 0 0 0 ...]

It then reads a multi-million line file, line-by-line. After performing some arbitrary computation on a line, the program increments some of the values in the vector. After the first line the vector may look like:

[1 1 1 2 0 1 0 1 1 ...]

After the second line:

[2 2 3 2 2 1 0 2 2 ...]

After ~5000 lines it may look something like:

[5000 4998 5008 5002 4225 5098 5002 5043 ...]

Since Clojure's data structures are immutable simply using assoc to increment a value in the vector seems incredibly wasteful, because the entirety of the vector will be copied for each increment.

What is the correct way to do this kind of concurrent data aggregation without spending all my CPU time copying immutable data structures? Should I have a vector where each element is something like a ref or atom, with all threads incrementing those shared values? Or, is there some kind of thread-level data structure that can store the counts, and then the final step is to consolidate the counts from each thread?

This probably won't be I/O bound on a single thread so I'm guessing I will split line processing across several threads. There is no limit to the length of the vector (it could be several thousand elements long) but more than likely it be around 100 elements long.

like image 911
FrostedMint Avatar asked Dec 01 '22 19:12

FrostedMint


1 Answers

Clojure's vector is persistent data structure. When updating an element within vector, it doesn't copy the entire elements, and takes essentially constant time, which means O(log32 n).

But it seems that you're updating almost every element within the vector each iteration. Perhaps you want to refer Transient Data Structures.

like image 127
ntalbs Avatar answered Dec 04 '22 02:12

ntalbs