Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Ruby's max function order duplicates?

I've been looking at the max method in Ruby's Enumerable mixin (v2.4.1).

It's a fairly straightforward method, but how it orders items when duplicates are present is a bit confusing.

For example:

x = [1,2,3,4,5,6,7,8,9]
x.max {|a, b| a%2 <=> b%2}
=> 1
10.times{|y| p x.max(y) {|a, b| a%2 <=> b%2}}
[]
[1]
[1, 7] # why is 7 the next element after 1?
[3, 1, 5] # why no more 7?
[7, 3, 1, 5] # 7 is now first
[9, 7, 3, 1, 5]
[9, 7, 3, 1, 5, 6]
[9, 7, 3, 1, 5, 4, 6]
[9, 7, 3, 1, 5, 2, 4, 6]
[9, 7, 5, 3, 1, 8, 6, 4, 2] # order has changed again (now seems more "natural")

How is 7 chosen as the second item? Why is it not chosen at all when three values are taken?

If you take even more numbers, the ordering is not consistent (though the items in the set are).

I have taken a glance at the source code, but it seems to be doing normal comparison; the ordering seen here isn't evident from that code.

Can anyone explain how this ordering is achieved? I know that the orderings above are all "valid", but how are they generated?

like image 480
River Avatar asked Sep 18 '17 19:09

River


1 Answers

Your example can be simplified by using max_by to yield the similar result:

10.times{|y| p x.max_by(y) {|t| t%2}}

I have spent some time with the source but can't find any hole.

After I remember seeing a publication called Switch: A Deep Embedding of Queries into Ruby (dissertation by Manuel Mayr) I found the answer.

Where on page 104 you can find the answer for max_by:

... Here, the value in the input list that assumes the maximum when evaluated by the function is returned. If more than one value yields the maximum, the choice for a result among these values is arbitrary. ...


Similarly for:
sort & sort_by from the comments @ emu.c

The result is not guaranteed to be stable. When two keys are equal, the order of the corresponding elements is unpredictable.

First, Second Edit - "we need to go deeper" => I hope you will enjoy the "ride".


The short answer:
The reason for the sorting looking like it does is combination of max_by block(causes to start sorting with max values from %2 which is 1 then it continues with 0) and qsort_r (BSD quick-sort) implemented @ruby.



The long answer: All is based on the source code of ruby 2.4.2 or currently 2.5.0 (which is being developed right now).

The quick-sort algorithm can differ based on the compiler you are using. You can be using qsort_r: GNU version, BSD version (you can check the configure.ac) for more. The visual studio is using BSD version from 2012 or later.

+Tue Sep 15 12:44:32 2015  Nobuyoshi Nakada  <[email protected]>
+
+   * util.c (ruby_qsort): use BSD-style qsort_r if available.


Thu May 12 00:18:19 2016  NAKAMURA Usaku  <[email protected]>

    * win32/Makefile.sub (HAVE_QSORT_S): use qsort_s only for Visual Studio
      2012 or later, because VS2010 seems to causes a SEGV in
test/ruby/test_enum.rb.
  1. If you have GNU qsort_r but not BSD: Only internal ruby_qsort implementation is used. Check util.c for internal implementation of the quick-sort (ruby_qsort(void* base, const size_t nel, const size_t size, cmpfunc_t *cmp, void *d)) function by Tomoyuki Kawamura. @util.h

    If HAVE_GNU_QSORT_R=1 then #define ruby_qsort qsort_r:

    #ifdef HAVE_GNU_QSORT_R
    #define ruby_qsort qsort_r
    #else    void ruby_qsort(void *, const size_t, const size_t,
        int (*)(const void *, const void *, void *), void *);
    #endif
    
  2. If the BSD-style is detected: Then the below code is used (can be found at util.c). Notice how the cmp_bsd_qsort is called before ruby_qsort. The reason? Probably standardization, stack space and maybe speed (did not test it myself - would have to create benchmark, and that is quite time consuming).

The saving stack space is indicated in the BSD qsort.c source code:

    /*
    * To save stack space we sort the smaller side of the partition first
    * using recursion and eliminate tail recursion for the larger side.
    */

The BSD branch at the ruby source code:

     #if defined HAVE_BSD_QSORT_R
    typedef int (cmpfunc_t)(const void*, const void*, void*);

    struct bsd_qsort_r_args {
        cmpfunc_t *cmp;
        void *arg;
    };

    static int
    cmp_bsd_qsort(void *d, const void *a, const void *b)
    {
        const struct bsd_qsort_r_args *args = d;
        return (*args->cmp)(a, b, args->arg);
    }

    void
    ruby_qsort(void* base, const size_t nel, const size_t size, cmpfunc_t *cmp, void *d)
    {
        struct bsd_qsort_r_args args;
        args.cmp = cmp;
        args.arg = d;
        qsort_r(base, nel, size, &args, cmp_bsd_qsort);
    }

If you are using MSYS2 to compile your ruby on Windows (no more DevKit but MSYS2 for windows installer, which I'm using most of the time ) NetBSD version of qsort_r (which is there from 02-07-2012). The latest NetBSD qsort.c (revision:1.23).

Now for the real-life examples - "we need to go even deeper"

The tests will be performed on two (windows) rubies:

  • First ruby: will be based on DevKit version 2.2.2p95 (which was released on 13 Apr 2015) and does not contain the BSD qsort implementation.

  • Second ruby: will be based on MSYS2 tool-chain version ruby 2.4.2-p198 (which was released on 15th Sep 2017) and does contain patch (see above) for BSD qsort implementation.

The code:

x=[1,2,3,4,5,6,7,8,9]
10.times{|y| p x.max_by(y) {|t| t%2}}

Ruby 2.2.2p95:

The result:
[]
[5]
[7, 1]
[3, 1, 5]
[7, 3, 1, 5]
[9, 7, 3, 1, 5]
[5, 9, 1, 3, 7, 6]
[5, 1, 9, 3, 7, 6, 4]
[5, 1, 3, 7, 9, 6, 4, 2]
[9, 1, 7, 3, 5, 4, 6, 8, 2]

Ruby 2.4.2-p198:

The result:
[]
[1]
[7, 1]
[5, 3, 1]
[5, 7, 3, 1]
[5, 9, 7, 3, 1]
[5, 1, 9, 7, 3, 6]
[5, 1, 3, 9, 7, 4, 6]
[5, 1, 3, 7, 9, 2, 6, 4]
[9, 1, 3, 7, 5, 8, 4, 6, 2]

Now for different x: x=[7,9,3,4,2,6,1,8,5]

Ruby 2.2.2p95:

The result:
[]
[1]
[9, 7]
[1, 7, 3]
[5, 1, 7, 3]
[5, 1, 3, 9, 7]
[7, 5, 9, 3, 1, 2]
[7, 9, 5, 3, 1, 2, 4]
[7, 9, 3, 1, 5, 2, 4, 8]
[5, 9, 1, 3, 7, 4, 6, 8, 2]

Ruby 2.4.2-p198:

The result:
[]
[9]
[9, 7]
[3, 1, 7]
[3, 5, 1, 7]
[7, 5, 1, 3, 9]
[7, 9, 5, 1, 3, 2]
[7, 9, 3, 5, 1, 4, 2]
[7, 9, 3, 1, 5, 8, 2, 4]
[5, 9, 3, 1, 7, 2, 4, 6, 8]

Now for the same items in the source array (qsort is unstable see below): x=[1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Process it with the following code: 12.times{|y| p x.max_by(y) {|t| t%2}}

Ruby 2.2.2p95:

The result:
[]
[3]
[1, 1]
[9, 1, 7]
[3, 9, 1, 7]
[5, 3, 9, 1, 7]
[1, 5, 3, 9, 1, 7]
[5, 9, 3, 7, 1, 1, 1]
[1, 5, 9, 1, 7, 1, 3, 4]
[1, 1, 5, 9, 1, 7, 3, 4, 2]
[1, 1, 1, 5, 7, 3, 9, 4, 2, 8]
[9, 1, 7, 1, 5, 3, 1, 2, 6, 8, 4]

Ruby 2.4.2-p198:

The Result:
[]
[1]
[1, 1]
[7, 9, 1]
[7, 3, 9, 1]
[7, 5, 3, 9, 1]
[7, 1, 5, 3, 9, 1]
[1, 5, 9, 3, 7, 1, 1]
[1, 1, 5, 9, 3, 7, 1, 4]
[1, 1, 1, 5, 9, 3, 7, 2, 4]
[1, 7, 3, 1, 5, 9, 1, 2, 4, 8]
[9, 3, 1, 7, 1, 5, 1, 2, 8, 6, 4]

Now for the big question --> Now why are the results different?

The first obvious answer would be that when using GNU or BSD implementations the result would be different? Right? Well the implementations are different, but yielding (check the linked implementations for details) same results. The core of the issue is elsewhere.

The real issue here the algorithm itself. When quick-sort is used then what you get is unstable sorting (when you are comparing the two equal values their order does not remain the same). When you have your [1,2,3,4,5,6,7,8,9] which you then transform in the block to [1,0,1,0,1,0,1,0,1] with max(_by) you are sorting the array into [1,1,1,1,1,0,0,0,0]. You start with 1, but which one? Well then you get unpredictable result. (max(_by) is the reason why are getting the odd numbers first and then the even ones).

See GNU qsort comment:

Warning: If two objects compare as equal, their order after sorting is unpredictable. That is to say, the sorting is not stable. This can make a difference when the comparison considers only part of the elements. Two elements with the same sort key may differ in other respects.

Now to sort it like the engine does:

[1,2,3,4,5,6,7,8,9] -> The first that are taken in account are the odd numbers [1,3,5,7,9] and those are considered equal with max_by{|t| t%2} produces [1,1,1,1,1].

Conclusion:

Now which one to take? Well it is unpredictable in your case it was the ones you got. I'll get different ones even for the same ruby version as the underlying quick-sort algorithm is unstable by its nature.

like image 197
tukan Avatar answered Nov 05 '22 13:11

tukan