Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select only a subset of dataframe columns in julia

Tags:

julia

I have a Dataframe of several columns say column1, column2...column100. How do I select only a subset of the columns eg (not column1) should return all columns column2...column100.

data[[colnames(data) .!= "column1"]])

doesn't seem to work.

I don't want to mutate the dataframe. I just want to select all the columns that don't have a particular column name like in my example

like image 710
Vishnu Avatar asked Sep 14 '15 06:09

Vishnu


People also ask

How do I select specific columns in Julia?

We can use: Symbol : select(df, :col) String : select(df, "col") Integer : select(df, 1)

How do I subset columns in pandas Dataframe?

Selecting columns is also known as selecting a subset of columns from the dataframe. You can select columns from Pandas Dataframe using the df. loc[:,'column_name'] statement.

How do I select only certain columns in R?

To select a column in R you can use brackets e.g., YourDataFrame['Column'] will take the column named “Column”. Furthermore, we can also use dplyr and the select() function to get columns by name or index. For instance, select(YourDataFrame, c('A', 'B') will take the columns named “A” and “B” from the dataframe.

What is subsetting a Dataframe?

Subsetting a data frame is the process of selecting a set of desired rows and columns from the data frame. You can select: all rows and limited columns. all columns and limited rows.


3 Answers

EDIT 2/7/2021: as people seem to still find this on Google, I'll edit this to say write at the top that current DataFrames (1.0+) allows both Not() selection supported by InvertedIndices.jl and also string types as column names, including regex selection with the r"" string macro. Examples:

julia> df = DataFrame(a1 = rand(2), a2 = rand(2), x1 = rand(2), x2 = rand(2), y = rand(["a", "b"], 2))
2×5 DataFrame
 Row │ a1        a2        x1        x2        y      
     │ Float64   Float64   Float64   Float64   String 
─────┼────────────────────────────────────────────────
   1 │ 0.784704  0.963761  0.124937  0.37532   a
   2 │ 0.814647  0.986194  0.236149  0.468216  a

julia> df[!, r"2"]
2×2 DataFrame
 Row │ a2        x2       
     │ Float64   Float64  
─────┼────────────────────
   1 │ 0.963761  0.37532
   2 │ 0.986194  0.468216

julia> df[!, Not(r"2")]
2×3 DataFrame
 Row │ a1        x1        y      
     │ Float64   Float64   String 
─────┼────────────────────────────
   1 │ 0.784704  0.124937  a
   2 │ 0.814647  0.236149  a

Finally, the names function has a method which takes a type as its second argument, which is handy for subsetting DataFrames by the element type of each column:


julia> df[!, names(df, String)]
2×1 DataFrame
 Row │ y      
     │ String 
─────┼────────
   1 │ a
   2 │ a

In addition to indexing with square brackets, there's also the select function (and its mutating equivalent select!), which basically takes the same input as the column index in []-indexing as its second argument:

julia> select(df, Not(r"a"))
2×3 DataFrame
 Row │ x1        x2        y      
     │ Float64   Float64   String 
─────┼────────────────────────────
   1 │ 0.124937  0.37532   a
   2 │ 0.236149  0.468216  a

Original answer below


As @Reza Afzalan said, what you're trying to do returns an array of strings, while column names in DataFrames are symbols.

Given that Julia doesn't have conditional list comprehension, the nicest thing you could do I guess would be

data[:, filter(x -> x != :column1, names(df))]

This will give you the data set with column 1 removed (without mutating it). You could extend this to checking against lists of names as well:

data[:, filter(x -> !(x in [:column1,:column2]), names(df))]

UPDATE: As Ian says below, for this use case the Not syntax is now the best way to go.

More generally, conditional list comprehensions are also available by now, so you could do:

data[:, [x for x in names(data) if x != :column1]]
like image 137
Nils Gudat Avatar answered Oct 21 '22 23:10

Nils Gudat


As of DataFrames 0.19, seems that you can now do

select(data, Not(:column1))

to select all but the column column1. To select all except for multiple columns, use an array in the inverted index:

select(data, Not([:column1, :column2]))

like image 31
Ian Fiske Avatar answered Oct 21 '22 23:10

Ian Fiske


To select several columns by name:

 df[[:col1, :col2]

or, for other versions of the DataFrames library, I use:

select(df, [:col1, :col2])
like image 27
Timothée HENRY Avatar answered Oct 21 '22 23:10

Timothée HENRY