Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summary table of unique value combinations in DataFrames.jl

I often want to find the unique combinations of some grouping variables in a data table. With R + dplyr, my normal workflow is to combine groupby(data, var1, var2, var3) %>% summarise, which returns a new table with the columns var1, var2, var3, with one row for each unique combination of values found in data.

What's the idiomatic way to do this in DataFrames.jl?

like image 658
Dave Kleinschmidt Avatar asked Dec 13 '22 07:12

Dave Kleinschmidt


2 Answers

In DataFrames.jl, a DataFrame is a collection of rows. So the right mental model here is to first select only the columns you care about, then get the unique rows from that table, as in

select(data, [:var1, :var2, :var3]) |> unique!

(Or if you hate the pipe/love extra parens:

unique!(select(data, [:var1, :var2, :var3]))

unique! is recommended here because select makes a copy of the underlying columns. Alternatively, you could use a view or indexing, but these require unique (which does not mutate the underlying column vectors) so as not to corrupt the original data frame:

unique(data[!, [:var1, :var2, :var3]])
unique(view(data, :, [:var1, :var2, :var3]))
like image 78
Dave Kleinschmidt Avatar answered Mar 04 '23 03:03

Dave Kleinschmidt


Alternatively you can write:

keys(groupby(data, [:var1, :var2, :var3]))

to get a vector of unique grouping keys. Then you can collect them to a DataFrame if you want by writing:

groupby(data, [:var1, :var2, :var3]) |> keys |> DataFrame
like image 25
Bogumił Kamiński Avatar answered Mar 04 '23 03:03

Bogumił Kamiński