Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

julia DataFrame select rows based values of one column belonging to a set

Using a DataFrame in Julia, I want to select rows on the basis of the value taken in a column.

With the following example

using DataFrames, DataFramesMeta
DT = DataFrame(ID = [1, 1, 2,2,3,3, 4,4], x1 = rand(8))

I want to extract the rows with ID taking the values 1 and 4. For the moment, I came out with that solution.

@where(DT, findall(x -> (x==4 || x==1), DT.ID))

When using only two values, it is manageable.

However, I want to make it applicable to a case with many rows and a large set of value for the ID to be selected. Therefore, this solution is unrealistic if I need to write down all the value to be selected

Any fancier solution to make this selection generic?

Damien

like image 286
djourd1 Avatar asked Oct 03 '19 13:10

djourd1


People also ask

How do you filter data in Julia?

We can filter rows by using filter(source => f::Function, df) . Note how this function is very similar to the function filter(f::Function, V::Vector) from Julia Base module.

How do I change the column name in Julia DataFrame?

Additionally, in your example, you should use select! in order to modify the column names in place, or alternatively do 'df = select(df, "col1" => "Id", "col2" => "Name")` as select always return a new DataFrame .


1 Answers

Here is a way to do it using standard DataFrames.jl indexing and using @where from DataFramesMeta.jl:

julia> DT
8×2 DataFrame
│ Row │ ID    │ x1        │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ 0.433397  │
│ 2   │ 1     │ 0.963775  │
│ 3   │ 2     │ 0.365919  │
│ 4   │ 2     │ 0.325169  │
│ 5   │ 3     │ 0.0495252 │
│ 6   │ 3     │ 0.637568  │
│ 7   │ 4     │ 0.391051  │
│ 8   │ 4     │ 0.436209  │

julia> DT[in([1,4]).(DT.ID), :]
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

julia> @where(DT, in([1,4]).(:ID))
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

In non performance critical code you can also use filter, which is - at least for me a bit simpler to digest (but it has a drawback, that it is slower than the methods discussed above):

julia> filter(row -> row.ID in [1,4], DT)
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

Note that in the approach you mention in your question you could omit DT in front of ID like this:

julia> @where(DT, findall(x -> (x==4 || x==1), :ID))
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

(this is a beauty of DataFramesMeta.jl that it knows the context of the DataFrame you want to refer to)

like image 188
Bogumił Kamiński Avatar answered Oct 16 '22 12:10

Bogumił Kamiński