Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between `UnitRange` and `Array`?

Tags:

julia

I have two versions of code that seem to do the same thing:

sum = 0
for x in 1:100
    sum += x
end
sum = 0
for x in collect(1:100)
    sum += x
end

Is there a practical difference between the two approaches?

like image 732
David Varela Avatar asked Sep 18 '19 21:09

David Varela


1 Answers

In Julia, 1:100 returns a particular struct called UnitRange that looks like this:

julia> dump(1:100)
UnitRange{Int64}
  start: Int64 1
  stop: Int64 100

This is a very compact struct to represent ranges with step 1 and arbitrary (finite) size. UnitRange is subtype of AbstractRange, a type to represent ranges with arbitrary step, subtype of AbstractVector.

The instances of UnitRange dynamically compute their elements whenever the you use getindex (or the syntactic sugar vector[index]). For example, with @less (1:100)[3] you can see this method:

function getindex(v::UnitRange{T}, i::Integer) where {T<:OverflowSafe}
    @_inline_meta
    val = v.start + (i - 1)
    @boundscheck _in_unit_range(v, val, i) || throw_boundserror(v, i)
    val % T
end

This is returning the i-th element of the vector by adding i - 1 to the first element (start) of the range. Some functions have optimised methods with UnitRange, or more generally with AbstractRange. For instance, with @less sum(1:100) you can see the following

function sum(r::AbstractRange{<:Real})
    l = length(r)
    # note that a little care is required to avoid overflow in l*(l-1)/2
    return l * first(r) + (iseven(l) ? (step(r) * (l-1)) * (l>>1)
                                     : (step(r) * l) * ((l-1)>>1))
end

This method uses the formula for the sum of an arithmetic progression, which is extremely efficient as it's evaluated in a time independent of the size of the vector.

On the other hand, collect(1:100) returns a plain Vector with one hundred elements 1, 2, 3, ..., 100. The main difference with UnitRange (or other types of AbstractRange) is that getindex(vector::Vector, i) (or vector[i], with vector::Vector) doesn't do any computation but simply accesses the i-th element of the vector. The downside of a Vector over a UnitRange is that generally speaking there aren't efficient methods when working with them as the elements of this container are completely arbitrary, while UnitRange represents a set of numbers with peculiar properties (sorted, constant step, etc...).

If you compare the performance of methods for which UnitRange has super-efficient implementations, this type will win hands down (note the use of interpolation of variables with $(...) when using macros from BenchmarkTools):

julia> using BenchmarkTools

julia> @btime sum($(1:1000_000))
  0.012 ns (0 allocations: 0 bytes)
500000500000

julia> @btime sum($(collect(1:1000_000)))
  229.979 μs (0 allocations: 0 bytes)
500000500000

Remember that UnitRange comes with the cost of dynamically computing the elements every time you access them with getindex. Consider for example this function:

function test(vec)
    sum = zero(eltype(vec))
    for idx in eachindex(vec)
        sum += vec[idx]
    end
    return sum
end

Let's benchmark it with a UnitRange and a plain Vector:

julia> @btime test($(1:1000_000))
  812.673 μs (0 allocations: 0 bytes)
500000500000

julia> @btime test($(collect(1:1000_000)))
  522.828 μs (0 allocations: 0 bytes)
500000500000

In this case the function calling the plain array is faster than the one with a UnitRange because it doesn't have to dynamically compute 1 million elements.

Of course, in these toy examples it'd be more sensible to iterate over all elements of vec rather than its indices, but in real world cases a situation like these may be more sensible. This last example, however, shows that a UnitRange is not necessarily more efficient than a plain array, especially if you need to dynamically compute all of its elements. UnitRanges are more efficient when you can take advantage of specialised methods (like sum) for which the operation can be performed in constant time.

As a file remark, note that if you originally have a UnitRange it's not necessarily a good idea to convert it to a plain Vector to get good performance, especially if you're going to use it only once or very few times, as the conversion to Vector involves itself the dynamic computation of all elements of the range and the allocation of the necessary memory:

julia> @btime collect($(1:1000_000));
  422.435 μs (2 allocations: 7.63 MiB)

julia> @btime test(collect($(1:1000_000)))
  882.866 μs (2 allocations: 7.63 MiB)
500000500000
like image 166
giordano Avatar answered Dec 13 '22 05:12

giordano