In julia I want to generate a rank within a group given by one column in the dataframe and the ranking based on a second column. I want to add the rank and number of rows in each group into new columns in the dataframe. The following appears to do what I require when grouping by :id and ranking by :b (is this correct?), but I have been unable to give the rank column a name. Is this sufficiently performant, or should I simply do it in a loop, after sorting on :id and :b? In my actual data/application I will have millions of rows and several hundred thousand unique groups.
x = DataFrame(id=["a","b","a","c","c","a","c","a","a","c"],
b=[2,5,7,8,3,9,1,10,4,6],
c=["one","two","three","four","five","six","seven","eight","nine","ten"])
y = combine(groupby(x,:id), sdf -> sort(sdf,:b), s->1:nrow(s), nrow => :n)
this outputs, which is correct, except that I wish to give 'x1' a column name like 'rank':
Row │ id b c x1 n
│ String Int64 String Int64 Int64
─────┼─────────────────────────────────────
1 │ a 2 one 1 5
2 │ a 4 nine 2 5
3 │ a 7 three 3 5
4 │ a 9 six 4 5
5 │ a 10 eight 5 5
6 │ b 5 two 1 1
7 │ c 1 seven 1 4
8 │ c 3 five 2 4
9 │ c 6 ten 3 4
10 │ c 8 four 4 4
This is the way to give rank column a name:
julia> combine(groupby(x,:id), sdf -> sort(sdf,:b), s->(rank=1:nrow(s),), nrow => :n)
10×5 DataFrame
Row │ id b c rank n
│ String Int64 String Int64 Int64
─────┼─────────────────────────────────────
1 │ a 2 one 1 5
2 │ a 4 nine 2 5
3 │ a 7 three 3 5
4 │ a 9 six 4 5
5 │ a 10 eight 5 5
6 │ b 5 two 1 1
7 │ c 1 seven 1 4
8 │ c 3 five 2 4
9 │ c 6 ten 3 4
10 │ c 8 four 4 4
or (which seems a bit cleaner for me)
julia> combine(groupby(x,:id), sdf -> sort(sdf,:b), :id => eachindex => :rank, nrow => :n)
10×5 DataFrame
Row │ id b c rank n
│ String Int64 String Int64 Int64
─────┼─────────────────────────────────────
1 │ a 2 one 1 5
2 │ a 4 nine 2 5
3 │ a 7 three 3 5
4 │ a 9 six 4 5
5 │ a 10 eight 5 5
6 │ b 5 two 1 1
7 │ c 1 seven 1 4
8 │ c 3 five 2 4
9 │ c 6 ten 3 4
10 │ c 8 four 4 4
(there is a PR open to add an option to add row-number more easily --- please comment there if you would find this useful)
Regarding your question about performance the following might be faster; the trade-off is between sorting a big data frame once vs small data frames many times:
transform!(groupby(sort(x, [:id, :b]), :id), :id => eachindex => :rank, nrow => :n)
It would be interesting to get a feedback on the comparison when you run this on your full data set.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With