Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subset Julia DataFrame by condition, where column has missing values

This seems like something that should be almost dead simple, yet I cannot accomplish it.

I have a dataframe df in julia, where one column is of type Array{Union{Missing, Int64},1}.

The values in that column are: [missing, 1, 2].

I would simply like to subset the dataframe df to just see those rows that correspond to a condition, such as where the column is equal to 2.

What I have tried --> result:

df[df[:col].==2] --> MethodError: no method matching getindex

df[df[:col].==2, :] --> ArgumentError: invalid row index of type Bool

df[df[:col].==2, :col] --> BoundsError: attempt to access String (note that doing just df[!, :col] results in: 1339-element Array{Union{Missing, Int64},1}: [...eliding output...], with my favorite warning so far in julia: Warning: getindex(df::DataFrame, col_ind::ColumnIndex) is deprecated, use df[!, col_ind] instead. Having just used that would seem to exempt me from the warning, but whatever.)

This cannot be as hard as it seems.

Just as FYI, I can get what I want through using Query and making a multi-line sql query just to subset data, which seems...burdensome.

like image 404
Savage Henry Avatar asked Jun 16 '20 01:06

Savage Henry


1 Answers

How to do row subsetting

There are two ways to solve your problem:

  1. use isequal instead of ==, as == implements 3-valued logic., so just writing one of will work:
df[isequal.(df.col,2), :] # new data frame
filter(:col => isequal(2), df) # new data frame
filter!(:col => isequal(2), df) # update old data frame in place
  1. if you want to use == use coalesce on top of it, e.g.:
df[coalesce.(df.col .== 2, false), :] # new data frame

There is nothing special about it related to DataFrames.jl. Indexing works the same way in Julia Base:

julia> x = [1, 2, missing]
3-element Array{Union{Missing, Int64},1}:
 1
 2
  missing

julia> x[x .== 2]
ERROR: ArgumentError: unable to check bounds for indices of type Missing

julia> x[isequal.(x, 2)]
1-element Array{Union{Missing, Int64},1}:
 2

(in general you can expect that, where possible, DataFrames.jl will work consistently with Julia Base; except for some corner cases where it is not possible - the major differences come from the fact that DataFrame has heterogeneous column element types while Matrix in Julia Base has homogeneous element type)

How to do indexing

DataFrame is a two-dimensional object. It has rows and columns. In Julia, normally, df[...] notation is used to access object via locations in its dimensions. Therefore df[:col] is not a valid way to index into a DataFrame. You are trying to use one indexing dimension, while specifying both row and column indices is required. You are getting a warning, because you are using an invalid indexing approach (in the next release of DataFrames.jl this warning will be gone and you will just get an error).

Actually your example df[df[:col].==2] shows why we disallow single-dimensional indexing. In df[:col] you try to use a single dimensional index to subset columns, but in outer df[df[:col].==2] you want to subset rows using a single dimensional index.

The easiest way to get a column from a data frame is df.col or df."col" (the second way is usually used if you have characters like spaces in the column name). This way you can access column :col without copying it. An equivalent way to write this selection using indexing is df[!, :col]. If you would want to copy the column write df[:, :col].

A side note - more advanced indexing

Indeed in Julia Base, if a is an array (of whatever dimension) then a[i] is a valid index if i is an integer or CartesianIndex. Doing df[i], where i is an integer is not allowed for DataFrame as it was judged that it would be too confusing for users if we wanted to follow the convention from Julia Base (as it is related to storage mode of arrays which is not the same as for DataFrame). You are though allowed to write df[i] when i is CartesianIndex (as this is unambiguous). I guess this is not something you are looking for.

All the rules what is allowed for indexing a DataFrame are described in detail here. Also during JuliaCon 2020 there is going to be a workshop during which the design of indexing in DataFrames.jl will be discussed in detail (how it works, why it works this way, and how it is implemented internally).

like image 108
Bogumił Kamiński Avatar answered Jan 03 '23 01:01

Bogumił Kamiński