Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Usage and convention differences between missing, nothing, undef, and NaN in Julia

Tags:

julia

I am looking for some guidance on when to use missing, nothing, undef, and NaN in Julia.

For example, all seem like reasonable choices for pre-allocating an array or returning from a try/catch.

like image 379
Nathan Boyer Avatar asked May 21 '20 14:05

Nathan Boyer


1 Answers

TLDR:

  • If you're working in statistics, chances are that you want missing to signal the absence of a particular data in a collection.

  • If you want to define an array of floating-point numbers, but initialize individual elements later, you might want to use undef for performance reasons (to avoid spending time setting elements to a value, which will get overriden afterwards):

    Vector{Float64}(undef, n)
    

    In the same situation, but following an approach less oriented towards performance and more towards safety, you can also initialize all elements to NaN in order to take advantage of the propagating behavior of NaN to help identify bugs that could happen if you forget to set some value in the array:

    fill(NaN, n)
    
  • You'll probably encounter nothing in some part of Julia's API to signal cases where no meaningful value can be computed. But it is generally not used in arrays otherwise contaning numeric data (which seems to be your use case here)


Here is my take on the differences between these options:




missing is used to represent missing values in a statistical sense, i.e. values that theoretically exist, but that you don't know. missing is similar in spirit (and in behavior, in most cases) to NA in R. A defining feature of missing values is that you can use them in computations:

julia> x = 1       # x has a known value: 1
1

julia> y = missing # y has a value, but it is unknown
missing

julia> z = x * y   # no error: z has a value, that just happens to be unknown
missing            # (as a consequence of not knowing the value of y

One important characteristic of missing is that it has its own specific type: Missing. This means in particular that arrays containing missing values among other numeric values are not homoegeneous in type:

julia> [1, missing, 3]
3-element Array{Union{Missing, Int64},1}: # not Array{Int64, 1}
 1
 missing
 3

Note that, although the Julia compiler has become very good at handling such heterogeneous arrays for such small unions, there is an inherent performance issue with having elements of different types, as we can not know in advance what the type of an element will be.




nothing also has its own type: Nothing. In contrast to missing, it tends to be used for things that have no value. Which is why, in contrast to missing, computing with nothing does not make sense, and errors out:

julia> 3*nothing
ERROR: MethodError: no method matching *(::Int64, ::Nothing)

nothing is primarily used as the return value of functions that don't return anything, either because they only have side-effects, or because they could not compute any meaningful result:

julia> @show println("OK")           # Only side effects
OK
println("OK") = nothing

julia> @show findfirst('a', "Hello") # No meaningful result
findfirst('a', "Hello") = nothing

An other notable use of nothing is in function arguments or object fields for which a value is not always provided. This would typically be represented in the type system as a Union{MeaningfulType, Nothing}. For example, with the following definition of a binary tree structure, a leaf (which, by definition, is a node that has no children) would be represented as a node of which the children are nothing:

struct TreeNode
  child1 :: Union{TreeNode, Nothing}
  child2 :: Union{TreeNode, Nothing}
end

leaf = TreeNode(nothing, nothing)




Unlike the previous two, NaN does not have its own specific type: NaN is merely a specific value of the Float64 type (and NaN32 similarly exists for Float32). As you probably know, these values normally appear as the result of undefined operations (such as 0/0), and have a very special meaning in floating-point arithmetic, which makes them propagate (in more or less the same way as missing values). But apart from that arithmetic behavior, these are normal floating-point values. In particular, a vector of floating-point values may contain NaNs without it affecting its type:

julia> [1., NaN, 2.]
3-element Array{Float64,1}: # Note how this differs from the example with missing above
 1.0
 NaN
 2.0




undef is very different from everything that has been mentioned so far. It is not really a value (at least not in the sense of a number having a value), but rather a "flag" that one can pass to array constructors to tell Julia not to initialize the values in the array (generally for performance considerations). In the following example, the array elements will not be set to any specific value but, since there is no such thing as a number without any value in Julia, elements will have arbitrary values (coming from whatever happens to be in memory where the vector gets allocated).

julia> Vector{Float64}(undef, 3)
3-element Array{Float64,1}:
 6.94567437726575e-310
 6.94569509953624e-310
 6.94567437549977e-310

When elements are of more complex type (in technical words: non-isbits type) and a distinction can be made between initialized and uninitialized elements, Julia denotes the latter with #undef

julia> mutable struct Foo end
julia> Vector{Foo}(undef, 3)
3-element Array{Foo,1}:
 #undef
 #undef
 #undef
like image 122
François Févotte Avatar answered Sep 28 '22 06:09

François Févotte