Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle missing in boolean context in Julia?

I'm trying to create a categorical variable based on ranges of values from other (numerical) column. However, the code don't work when I have missings in the numerical column

Here is a replicable example:

using RDatasets;
using DataFrames;
using Pipe;
using FreqTables;

df = dataset("datasets","iris")
#lowercase columns just for convenience
@pipe df |> rename!(_, [lowercase(k) for k in names(df)]);

#without this line, the code works fine
@pipe df |> allowmissing!(_, :sepallength) |> replace!(_.sepallength, 4.9 => missing);

df[:size] = @. ifelse(df[:sepallength]<=4.7, "small", missing)
df[:size] = @. ifelse((df[:sepallength]>4.7) & (df[:sepallength]<=4.9), "avg", df[:size])
df[:size] = @. ifelse((df[:sepallength]>4.9) & (df[:sepallength]<=5), "large", df[:size])
df[:size] = @. ifelse(df[:sepallength]>5, "huge", df[:size])

println(@pipe df |> freqtable(_, :size))

Output:

TypeError: non-boolean (Missing) used in boolean context

I would like to ignore the missing cases in the numerical variable but I cannot just drop de missings because this will drop other important informations in my dataset. Moreover, if I drop just the missings in sepallength the column df[:size] would have a different length than the original dataframe.

like image 937
Lucas Avatar asked Jan 24 '23 15:01

Lucas


2 Answers

I think Bogumil's approach is correct and probably best for most situations, but one other option that I like to use is to define my own comparison operators that can deal with missings by returning false if a missing is encountered. Using the unicode capabilities of Julia makes this quite pleasant in my opinion:

julia> ==ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x == y;

julia> >=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x >= y;

julia> <=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x <= y;

julia> <ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x < y;

julia> >ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x > y;

julia> x = rand([missing; 1:10], 50)

julia> x .> 10
50-element Array{Union{Missing, Bool},1}
...

julia> x .>ₘ 10
50-element BitArray{1}
...

There are of course downsides to defining such an elementary operator in your own code, particularly using Unicode as well, in terms of your code being harder for other people to read (and potentially even to display correctly!), so I probably wouldn't advocate for this as the standard approach, or something to be used in library code. I do think though that for explorative work it makes life easier.

like image 42
Nils Gudat Avatar answered Feb 06 '23 08:02

Nils Gudat


Use the coalesce function like this:

julia> x = [1,2,3,missing,5,6,7]
7-element Array{Union{Missing, Int64},1}:
 1
 2
 3
  missing
 5
 6
 7

julia> @. ifelse(coalesce(x < 4.7, false), "small", missing)
7-element Array{Union{Missing, String},1}:
 "small"
 "small"
 "small"
 missing
 missing
 missing
 missing

As a side note do not write df[:size] (this syntax has been deprecated for over 2 years now and soon it will error) but rather df.size or df."size" to access the column of the data frame (the df."size" is for cases when your column names contain characters like spaces etc., e.g. df."my fancy column!").

like image 169
Bogumił Kamiński Avatar answered Feb 06 '23 06:02

Bogumił Kamiński