I'm trying to create a categorical variable based on ranges of values from other (numerical) column. However, the code don't work when I have missings
in the numerical column
Here is a replicable example:
using RDatasets;
using DataFrames;
using Pipe;
using FreqTables;
df = dataset("datasets","iris")
#lowercase columns just for convenience
@pipe df |> rename!(_, [lowercase(k) for k in names(df)]);
#without this line, the code works fine
@pipe df |> allowmissing!(_, :sepallength) |> replace!(_.sepallength, 4.9 => missing);
df[:size] = @. ifelse(df[:sepallength]<=4.7, "small", missing)
df[:size] = @. ifelse((df[:sepallength]>4.7) & (df[:sepallength]<=4.9), "avg", df[:size])
df[:size] = @. ifelse((df[:sepallength]>4.9) & (df[:sepallength]<=5), "large", df[:size])
df[:size] = @. ifelse(df[:sepallength]>5, "huge", df[:size])
println(@pipe df |> freqtable(_, :size))
Output:
TypeError: non-boolean (Missing) used in boolean context
I would like to ignore the missing cases in the numerical variable but I cannot just drop de missings because this will drop other important informations in my dataset. Moreover, if I drop just the missings in sepallength
the column df[:size]
would have a different length than the original dataframe
.
I think Bogumil's approach is correct and probably best for most situations, but one other option that I like to use is to define my own comparison operators that can deal with missings by returning false if a missing is encountered. Using the unicode capabilities of Julia makes this quite pleasant in my opinion:
julia> ==ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x == y;
julia> >=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x >= y;
julia> <=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x <= y;
julia> <ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x < y;
julia> >ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x > y;
julia> x = rand([missing; 1:10], 50)
julia> x .> 10
50-element Array{Union{Missing, Bool},1}
...
julia> x .>ₘ 10
50-element BitArray{1}
...
There are of course downsides to defining such an elementary operator in your own code, particularly using Unicode as well, in terms of your code being harder for other people to read (and potentially even to display correctly!), so I probably wouldn't advocate for this as the standard approach, or something to be used in library code. I do think though that for explorative work it makes life easier.
Use the coalesce
function like this:
julia> x = [1,2,3,missing,5,6,7]
7-element Array{Union{Missing, Int64},1}:
1
2
3
missing
5
6
7
julia> @. ifelse(coalesce(x < 4.7, false), "small", missing)
7-element Array{Union{Missing, String},1}:
"small"
"small"
"small"
missing
missing
missing
missing
As a side note do not write df[:size]
(this syntax has been deprecated for over 2 years now and soon it will error) but rather df.size
or df."size"
to access the column of the data frame (the df."size"
is for cases when your column names contain characters like spaces etc., e.g. df."my fancy column!").
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With