Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Julia DataFrames.jl - filter data with NA's (NAException)

Tags:

julia

I am not sure how to handle NA within Julia DataFrames.

For example with the following DataFrame:

> import DataFrames
> a = DataFrames.@data([1, 2, 3, 4, 5]);
> b = DataFrames.@data([3, 4, 5, 6, NA]);
> ndf = DataFrames.DataFrame(a=a, b=b)

I can successfully execute the following operation on column :a

> ndf[ndf[:a] .== 4, :]

but if I try the same operation on :b I get an error NAException("cannot index an array with a DataArray containing NA values").

> ndf[ndf[:b] .== 4, :]

NAException("cannot index an array with a DataArray containing NA values")
while loading In[108], in expression starting on line 1

in to_index at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:85
in getindex at /Users/abisen/.julia/v0.3/DataArrays/src/indexing.jl:210
in getindex at /Users/abisen/.julia/v0.3/DataFrames/src/dataframe/dataframe.jl:268

Which is because of the presence of NA value.

My question is how should DataFrames with NA should typically be handled? I can understand that > or < operation against NA would be undefined but == should work (no?).

like image 808
datafig Avatar asked Jul 09 '15 23:07

datafig


2 Answers

What's your desired behavior here? If you want to do selections like this you can make the condition (not a NAN) AND (equal to 4). If the first test fails then the second one never happens.

using DataFrames
a = @data([1, 2, 3, 4, 5]);
b = @data([3, 4, 5, 6, NA]);
ndf = DataFrame(a=a, b=b)
ndf[(!isna(ndf[:b]))&(ndf[:b].==4),:]

In some cases you might just want to drop all rows with NAs in certain columns

ndf = ndf[!isna(ndf[:b]),:]
like image 97
ARM Avatar answered Oct 26 '22 15:10

ARM


Regarding to this question I asked before, you can change this NA behavior directly in the modules sourcecode if you want. In the file indexing.jl there is a function named Base.to_index(A::DataArray) beginning at line 75, where you can alter the code to set NA's in the boolean array to false. For example you can do the following:

# Indexing with NA throws an error
function Base.to_index(A::DataArray)
    A[A.na] = false
    any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
    Base.to_index(A.data)
end 

Ignoring NA's with isna() will cause a less readable sourcecode and in big formulas, a performance loss:

@timeit ndf[(!isna(ndf[:b])) & (ndf[:b] .== 4),:]  #3.68 µs per loop
@timeit ndf[ndf[:b] .== 4, :]  #2.32 µs per loop

## 71x179 2D Array
@timeit dm[(!isna(dm)) & (dm .< 3)] = 1  #14.55 µs per loop  
@timeit dm[dm .< 3] = 1  #754.79 ns per loop 
like image 41
Manuel Avatar answered Oct 26 '22 14:10

Manuel