I tried to speed up an R function by porting it to Julia, but to my surprise Julia was slower. The function sequentially updates a list of vectors (array of arrays in Julia). Beforehand the index of the list element to be updated is unknown and the length of the new vector is unknown. I have written a test function that demonstrates the behavior.
Julia
function MyTest(n)
  a = [[0.0] for i in 1:n]
    for i in 1:n
      a[i] = cumsum(ones(i))
    end  
  a
end
R
MyTest <- function(n){
  a <- as.list(rep(0, n))
  for (i in 1:n) 
    a[[i]] <- cumsum(rep(1, i))
  a
}
By setting n to 5000, 10000 and 20000, typical computing times are (median of 21 tests):
I used a windows-laptop with 64 bit Julia-1.3.1 and 64 bit R-3.6.1.
Both these functions use 64 bit floating-point types. My real problem involves integers and then R is even more favorable. But integer comparison isn’t fair since R uses 32 bit integers and Julia 64 bit. Is it something I can do to speed up Julia or is really Julia much slower than R in this case?
I don't quite see how you get your test results. Assuming you want 32 bit integers, as you said, then we have
julia> function mytest(n)
           a = Vector{Vector{Int32}}(undef, n)
           for i in 1:n
               a[i] = cumsum(ones(i))
           end
           return a
       end
mytest (generic function with 1 method)
julia> @btime mytest(20000);
  1.108 s (111810 allocations: 3.73 GiB)
When we only get rid of those allocations, we already get down to the following:
julia> function mytest(n)
           a = Vector{Vector{Int32}}(undef, n)
           @inbounds for i in 1:n
               a[i] = collect(UnitRange{Int32}(1, i))
           end
           return a
       end
mytest (generic function with 1 method)
julia> @btime mytest(20000);
  115.702 ms (35906 allocations: 765.40 MiB)
Further devectorization does not even help:
julia> function mytest(n)
           a = Vector{Vector{Int32}}(undef, n)
           @inbounds for i in 1:n
               v = Vector{Int32}(undef, i)
               v[1] = 1
               @inbounds for j = 2:i
                   v[j] = v[j-1] + 1
               end
               a[i] = v
           end
           return a
       end
mytest (generic function with 1 method)
julia> @btime mytest(20000);
  188.856 ms (35906 allocations: 765.40 MiB)
But with a couple of threads (I assume the inner arrays are independent), we get 2x speed-up again:
julia> Threads.nthreads()
4
julia> function mytest(n)
           a = Vector{Vector{Int32}}(undef, n)
           Threads.@threads for i in 1:n
               v = Vector{Int32}(undef, i)
               v[1] = 1
               @inbounds for j = 2:i
                   v[j] = v[j-1] + 1
               end
               a[i] = v
           end
           return a
       end
mytest (generic function with 1 method)
julia> @btime mytest(20000);
  99.718 ms (35891 allocations: 763.13 MiB)
But this is only about as fast as the second variant above.
That is, for the specific case of cumsum.  Other inner functions are slower, of course, but can be equally threaded, and optimized in the same ways, with possibly different results.
(This is on Julia 1.2, 12 GiB RAM, and an older i7.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With