Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting in Ruby using the Unicode collation algorithm

Ruby and Postgres are sorting slightly differently and this is causing subtle problems in my project. There's two issues: accented characters and spaces. It looks like Ruby is sorting ASCII-betically and Postgres is sorting with the proper Unicode collation algorithm.

Heroku Postgres 11.2. Database collation is en_US.UTF-8.

psql (11.3, server 11.2 (Ubuntu 11.2-1.pgdg16.04+1))
...
=> select 'quia et' > 'qui qui';
 ?column? 
----------
 f
(1 row)
=> select 'quib' > 'qüia';
 ?column? 
----------
 t
(1 row)

Ruby 2.4.4 on Heroku.

Loading production environment (Rails 5.2.2.1)
[1] pry(main)> 'quia et' > 'qui qui'
=> true
[2] pry(main)> 'quib' > 'qüia'
=> false
[3] pry(main)> ENV['LANG']
=> "en_US.UTF-8"

I can fix the handling of accented characters, but I can't get Ruby to do spaces correctly. For example, here's how they sort the same list.

Postgres: ["hic et illum", "quia et ipsa", "qui qui non"]
Ruby:     ["hic et illum", "qui qui non", "quia et ipsa"]

I've tried the icunicode gem:

array.sort_by {|s| s.unicode_sort_key}

This handles accented characters, but does not do spaces correctly.

How do I get Ruby to sort using the Unicode collation algorithm?

UPDATE A more comprehensive example is found in Unicode® Technical Standard #10. These are in the correct order.

  [
    "di Silva   Fred",
    "diSilva    Fred",
    "disílva    Fred",
    "di Silva   John",
    "diSilva    John",
    "disílva    John"
  ]
like image 444
Schwern Avatar asked Jun 07 '19 20:06

Schwern


2 Answers

I got very close using this algorithm with the icunicode gem.

require 'icunicode'

def database_sort_key(key)
  key.gsub(/\s+/,'').unicode_sort_key
end

array.sort_by { |v|
  [database_sort_key(v), v.unicode_sort_key]
}

First we sort using the unicode sort key with whitespace removed. Then if those are the same we sort by the unicode sort key of the original value.

This works around a weakness in unicode_sort_key: it doesn't consider spaces to be weak.

2.4.4 :007 > "fo p".unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "4", "47", "1", "8", "1", "8"] 
2.4.4 :008 > "foo".unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "45", "1", "7", "1", "7"] 

Note that the space in fo p is as important as any other character. This results in 'fo p' < 'foo' which is incorrect. We work around this by first stripping out spaces before generating the key.

2.4.4 :011 > "fo p".gsub(/\s+/, '').unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "47", "1", "7", "1", "7"] 
2.4.4 :012 > "foo".gsub(/\s+/, '').unicode_sort_key.bytes.map { |b| b.to_s(16) }
 => ["33", "45", "45", "1", "7", "1", "7"] 

Now 'foo' < 'fo p' which is correct.

But because of the normalization we might have values which appear to be the same after whitespace has been stripped, fo o should be less than foo. So if the database_sort_keys are the same, we compare their plain unicode_sort_keys.

There's a few edge cases where it's wrong. foo should be less than fo o but this gets it backwards.

Here's it is as Enumerable methods.

module Enumerable
  # Just like `sort`, but tries to sort the same as the database does
  # using the proper Unicode collation algorithm. It's close.
  #
  # Differences in spacing, cases, and accents are less important than
  # character differences.
  #
  # "foo" < "fo p" o vs p is more important than the space difference
  # "Foo" < "fop" o vs p is more important than is case difference
  # "föo" < "fop" o vs p is more important than the accent difference
  #
  # It does not take a block.
  def sort_like_database(&block)
    if block_given?
      raise ArgumentError, "Does not accept a block"
    else
      # Sort by the database sort key. Two different strings can have the
      # same keys, if so sort just by its unicode sort key.
      sort_by { |v| [database_sort_key(v), v.unicode_sort_key] }
    end
  end

  # Just like `sort_by`, but it sorts like `sort_like_database`.
  def sort_by_like_database(&block)
    sort_by { |v|
      field = block.call(v)
      [database_sort_key(field), field.unicode_sort_key]
    }
  end

  # Sort by the unicode sort key after stripping out all spaces. This provides
  # a decent simulation of the Unicode collation algorithm and how it handles
  # spaces.
  private def database_sort_key(key)
    key.gsub(/\s+/,'').unicode_sort_key
  end
end
like image 149
Schwern Avatar answered Nov 08 '22 11:11

Schwern


Does your use case allow for simply delegating the sorting to Postgres, rather than trying to recreate it in Ruby?

Part of the difficultly here is that there isn't a single correct sorting method, but any variable elements can result in fairly large discrepancies in the final sort order e.g. see the section on variable weighting.

For example, a gem like twitter-cldr-rb has a fairly robust implementation of the UCA, and is backed by a comprehensive test suite - but against the non-ignorable test cases, which is different from the Postgres implementation (Postgres appears to use the shift-trimmed variant).

The sheer number of test cases means that you have no guarantee that one working solution will match the Postgres sort order in all cases. E.g. will it handle en/em dash correctly, or even emojis? You could fork and modify the twitter-cldr-rb gem, but I suspect this won't be a small undertaking!

If you need to handle values that don't exist in the database, you can ask Postgres to sort them in a lightweight fashion by using a VALUES list:

sql = "SELECT * FROM (VALUES ('de luge'),('de Luge'),('de-luge'),('de-Luge'),('de-luge'),('de-Luge'),('death'),('deluge'),('deLuge'),('demark')) AS t(term) ORDER BY term ASC"
ActiveRecord::Base.connection.execute(sql).values.flatten

It will obviously result in a round-trip to Postgres, but should be very speedy nonetheless.

like image 23
gwcodes Avatar answered Nov 08 '22 10:11

gwcodes