Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Ruby's Enumerable#zip create arrays internally?

In Ruby - Compare two Enumerators elegantly, it was said

The problem with zip is that it creates arrays internally, no matter what Enumerable you pass. There's another problem with length of input params

I had a look at the implementation of Enumerable#zip in YARV, and saw

static VALUE
enum_zip(int argc, VALUE *argv, VALUE obj)
{
    int i;
    ID conv;
    NODE *memo;
    VALUE result = Qnil;
    VALUE args = rb_ary_new4(argc, argv);
    int allary = TRUE;

    argv = RARRAY_PTR(args);
    for (i=0; i<argc; i++) {
        VALUE ary = rb_check_array_type(argv[i]);
        if (NIL_P(ary)) {
            allary = FALSE;
            break;
        }
        argv[i] = ary;
    }
    if (!allary) {
        CONST_ID(conv, "to_enum");
        for (i=0; i<argc; i++) {
            argv[i] = rb_funcall(argv[i], conv, 1, ID2SYM(id_each));
        }
    }
    if (!rb_block_given_p()) {
        result = rb_ary_new();
    }
    /* use NODE_DOT2 as memo(v, v, -) */
    memo = rb_node_newnode(NODE_DOT2, result, args, 0);
    rb_block_call(obj, id_each, 0, 0, allary ? zip_ary : zip_i, (VALUE)memo);

    return result;
}

Am I understanding the following bits correctly?

Check whether all of the arguments are arrays, and if so, replace some indirect reference to the array with a direct reference

    for (i=0; i<argc; i++) {
        VALUE ary = rb_check_array_type(argv[i]);
        if (NIL_P(ary)) {
            allary = FALSE;
            break;
        }
        argv[i] = ary;
    }

If they aren't all arrays, create an enumerator instead

    if (!allary) {
        CONST_ID(conv, "to_enum");
        for (i=0; i<argc; i++) {
            argv[i] = rb_funcall(argv[i], conv, 1, ID2SYM(id_each));
        }
    }

Create an array of arrays only if a block isn't given

    if (!rb_block_given_p()) {
        result = rb_ary_new();
    }

If everything is an array, use zip_ary, otherwise use zip_i, and call a block on each set of values

    /* use NODE_DOT2 as memo(v, v, -) */
    memo = rb_node_newnode(NODE_DOT2, result, args, 0);
    rb_block_call(obj, id_each, 0, 0, allary ? zip_ary : zip_i, (VALUE)memo);

Return an array of arrays if no block is given, else return nil (Qnil)?

    return result;
}
like image 431
Andrew Grimm Avatar asked Jun 27 '11 00:06

Andrew Grimm


1 Answers

I'll be using 1.9.2-p0 as that's what I have on hand.

The rb_check_array_type function looks like this:

VALUE
rb_check_array_type(VALUE ary)
{
    return rb_check_convert_type(ary, T_ARRAY, "Array", "to_ary");  
}

And rb_check_convert_type looks like this:

VALUE
rb_check_convert_type(VALUE val, int type, const char *tname, const char *method)
{
    VALUE v;

    /* always convert T_DATA */
    if (TYPE(val) == type && type != T_DATA) return val;
    v = convert_type(val, tname, method, FALSE);
    if (NIL_P(v)) return Qnil;
    if (TYPE(v) != type) {
        const char *cname = rb_obj_classname(val);
        rb_raise(rb_eTypeError, "can't convert %s to %s (%s#%s gives %s)",
                 cname, tname, cname, method, rb_obj_classname(v));
    }
    return v;
}

Note the convert_type call. This looks a lot like C version of Array.try_convert and try_convert just happens to look like this:

/*   
 *  call-seq:
 *     Array.try_convert(obj) -> array or nil
 *
 *  Try to convert <i>obj</i> into an array, using +to_ary+ method. 
 *  Returns converted array or +nil+ if <i>obj</i> cannot be converted
 *  for any reason. This method can be used to check if an argument is an
 *  array.
 *   
 *     Array.try_convert([1])   #=> [1]
 *     Array.try_convert("1")   #=> nil
 *
 *     if tmp = Array.try_convert(arg)
 *       # the argument is an array
 *     elsif tmp = String.try_convert(arg)
 *       # the argument is a string
 *     end
 *
 */
static VALUE
rb_ary_s_try_convert(VALUE dummy, VALUE ary)
{
    return rb_check_array_type(ary);
}

So, yes, the first loop is looking for anything in argv that is not an array and setting the allary flag if it finds such a thing.

In enum.c, we see this:

id_each = rb_intern("each");

So id_each is an internal reference for the Ruby each iterator method. And in vm_eval.c, we have this:

/*!  
 * Calls a method 
 * \param recv   receiver of the method
 * \param mid    an ID that represents the name of the method
 * \param n      the number of arguments
 * \param ...    arbitrary number of method arguments  
 *
 * \pre each of arguments after \a n must be a VALUE.
 */
VALUE
rb_funcall(VALUE recv, ID mid, int n, ...)

So this:

argv[i] = rb_funcall(argv[i], conv, 1, ID2SYM(id_each));

Is calling to_enum (with, essentially, the default argument) on whatever is in argv[i].

So, the end result of the first for and if blocks is that argv is either full of arrays or full of enumerators rather than possibly being a mix of the two. But note how the logic works: if something is found that isn't an array, then everything becomes an enumerator. The first part of the enum_zip function will wrap arrays in enumerators (which is essentially free or at least cheap enough not to worry about) but won't expand enumerators into arrays (which could be quite expensive). Earlier versions might have gone the other way (prefer arrays over enumerators), I'll leave that as an exercise for the reader or historians.

The next part:

if (!rb_block_given_p()) {
    result = rb_ary_new();
}

Creates a new empty array and leaves it in result if zip is being called without a block. And here we should note what zip returns:

enum.zip(arg, ...) → an_array_of_array
enum.zip(arg, ...) {|arr| block } → nil

If there is a block, then there is nothing to return and result can stay as Qnil; if there isn't a block, then we need an array in result so that an array can be returned.

From parse.c, we see that NODE_DOT2 is a double-dot range but it looks like they're just using the new node as a simple three element struct; rb_new_node just allocates an object, sets some bits, and assigns three values in a struct:

NODE*
rb_node_newnode(enum node_type type, VALUE a0, VALUE a1, VALUE a2)
{
    NODE *n = (NODE*)rb_newobj();

    n->flags |= T_NODE;
    nd_set_type(n, type);

    n->u1.value = a0;
    n->u2.value = a1;
    n->u3.value = a2;

    return n;
}

nd_set_type is just a bit fiddling macro. Now we have memo as just a three element struct. This use of NODE_DOT2 appears to be a convenient kludge.

The rb_block_call function appears to be the core internal iterator. And we see our friend id_each again so we'll be doing an each iteration. Then we see a choice between zip_i and zip_ary; this is where the inner arrays are created and pushed onto result. The only difference between zip_i and zip_ary appears to be the StopIteration exception handling in zip_i.

At this point we've done the zipping and we either have the array of arrays in result (if there was no block) or we have Qnil in result (if there was a block).


Executive Summary: The first loop explicitly avoids expanding enumerators into arrays. The zip_i and zip_ary calls will only work with non-temporary arrays if they have to build an array of arrays as a return value. So, if you call zip with at least one non-array enumerator and use the block form, then it is enumerators all the way down and the "problem with zip is that it creates arrays internally" does not happen. Reviewing 1.8 or other Ruby implementations is left as an exercise for the reader.

like image 83
mu is too short Avatar answered Oct 06 '22 00:10

mu is too short