I am not sure how to handle NA
within Julia DataFrames.
For example with the following DataFrame:
> import DataFrames
> a = DataFrames.@data([1, 2, 3, 4, 5]);
> b = DataFrames.@data([3, 4, 5, 6, NA]);
> ndf = DataFrames.DataFrame(a=a, b=b)
I can successfully execute the following operation on column :a
> ndf[ndf[:a] .== 4, :]
but if I try the same operation on :b
I get an error NAException("cannot index an array with a DataArray containing NA values")
.
> ndf[ndf[:b] .== 4, :]
NAException("cannot index an array with a DataArray containing NA values")
while loading In[108], in expression starting on line 1
in to_index at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:85
in getindex at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:210
in getindex at /Users/abisen/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:268
Which is because of the presence of NA value.
My question is how should DataFrames with NA
should typically be handled? I can understand that >
or <
operation against NA
would be undefined
but ==
should work (no?).
What's your desired behavior here? If you want to do selections like this you can make the condition (not a NAN) AND (equal to 4). If the first test fails then the second one never happens.
using DataFrames
a = @data([1, 2, 3, 4, 5]);
b = @data([3, 4, 5, 6, NA]);
ndf = DataFrame(a=a, b=b)
ndf[(!isna(ndf[:b]))&(ndf[:b].==4),:]
In some cases you might just want to drop all rows with NAs in certain columns
ndf = ndf[!isna(ndf[:b]),:]
Regarding to this question I asked before, you can change this NA behavior directly in the modules sourcecode if you want. In the file indexing.jl
there is a function named Base.to_index(A::DataArray)
beginning at line 75, where you can alter the code to set NA's in the boolean array to false
. For example you can do the following:
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
A[A.na] = false
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
Ignoring NA's with isna()
will cause a less readable sourcecode and in big formulas, a performance loss:
@timeit ndf[(!isna(ndf[:b])) & (ndf[:b] .== 4),:] #3.68 µs per loop
@timeit ndf[ndf[:b] .== 4, :] #2.32 µs per loop
## 71x179 2D Array
@timeit dm[(!isna(dm)) & (dm .< 3)] = 1 #14.55 µs per loop
@timeit dm[dm .< 3] = 1 #754.79 ns per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With