Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort a Julia 1.1 matrix by one of its columns, that contains strings

As the title suggests, I need to sort the rows of a certain matrix by one of its columns, preferably in place if at all possible. Said column contains Strings (the array being of type Array{Union{Float64,String}}), and ideally the rows should end up in an alphabetial order, determined by this column. The line

sorted_rows = sort!(data, by = i -> data[i,2]),

where data is my matrix, produces the error ERROR: LoadError: UndefKeywordError: keyword argument dims not assigned. Specifying which part of the matrix I want sorted and adding the parameter dims=2 (which I assume is the dimension I want to sort along), namely

sorted_rows = sort!(data[2:end-1,:], by = i -> data[i,2],dims=2)

simply changes the error message to ERROR: LoadError: ArgumentError: invalid index: 01 Suurin yhteinen tekijä ja pienin yhteinen jaettava of type String. So the compiler is complainig about a string being an invalid index.

Any ideas on how this type of sorting cound be done? I should say that in this case the string in the column can be expected to start with a number, but I wouldn't mind finding a solution that works in the general case.

I'm using Julia 1.1.

like image 750
SeSodesa Avatar asked Mar 26 '19 17:03

SeSodesa


2 Answers

You want sortslices, not sort — the latter just sorts all columns independently, whereas the former rearranges whole slices. Secondly, the by function doesn't take an index, it takes the value that is about to be compared (and allows you to transform it in some way). Thus:

julia> using Random
       data = Union{Float64, String}[randn(100) [randstring(10) for _ in 1:100]]
100×2 Array{Union{Float64, String},2}:
  0.211015  "6VPQbWU5f9"
 -0.292298  "HgvHLkufqI"
  1.74231   "zTCu1U5Vdl"
  0.195822  "O3j43sbhKV"
  ⋮
 -0.369007  "VzFH2OpWfU"
 -1.30459   "6C68G64AWg"
 -1.02434   "rldaQ3e0GE"
  1.61653   "vjvn1SX3FW"

julia> sortslices(data, by=x->x[2], dims=1)
100×2 Array{Union{Float64, String},2}:
  0.229143  "0syMQ7AFgQ"
 -0.642065  "0wUew61bI5"
  1.16888   "12PUn4V4gL"
 -0.266574  "1Z2ONSBP04"
  ⋮
  1.85761   "y2DDANcFCe"
  1.53337   "yZju1uQqMM"
  1.74231   "zTCu1U5Vdl"
  0.974607  "zdiU0sVOZt"

Unfortunately we don't have an in-place sortslices! yet, but you can easily construct a sorted view with sortperm. This probably won't be as fast to use, but if you need the in-place-ness for semantic reasons it'll do just the trick.

julia> p = sortperm(data[:,2]);

julia> @view data[p, :]
100×2 view(::Array{Union{Float64, String},2}, [26, 45, 90, 87, 6, 96, 82, 75, 12, 27  …  53, 69, 100, 93, 36, 37, 39, 8, 3, 61], :) with eltype Union{Float64, String}:
  0.229143  "0syMQ7AFgQ"
 -0.642065  "0wUew61bI5"
  1.16888   "12PUn4V4gL"
 -0.266574  "1Z2ONSBP04"
  ⋮
  1.85761   "y2DDANcFCe"
  1.53337   "yZju1uQqMM"
  1.74231   "zTCu1U5Vdl"
  0.974607  "zdiU0sVOZt"

(If you want the in-place-ness for performance reasons, I'd recommend using a DataFrame or similar structure that holds its columns as independent homogenous vectors — a Union{Float64, String} will be slower than two separate well-typed vectors, and sort!ing a DataFrame works on whole rows like you want.)

like image 77
mbauman Avatar answered Sep 28 '22 08:09

mbauman


you may want to look at SortingLab.jls fast string sort functions.

]add SortingLab
using SortingLab
idx = fsortperm(data[:,2])
new_data = data[idx]
like image 36
xiaodai Avatar answered Sep 28 '22 07:09

xiaodai