Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are only a limited number of regular expression captures stored in `global_variables`?

If I do a match with a regular expression with ten captures:

/(o)(t)(th)(f)(fi)(s)(se)(e)(n)(t)/.match("otthffisseent")

then, for $10, I get:

$10 # => "t"

but it is missing from global_variables. I get (in an irb session):

[:$;, :$-F, :$@, :$!, :$SAFE, :$~, :$&, :$`, :$', :$+, :$=, :$KCODE, :$-K, :$,,
 :$/, :$-0, :$\, :$_, :$stdin, :$stdout, :$stderr, :$>, :$<, :$., :$FILENAME,
 :$-i, :$*, :$?, :$$, :$:, :$-I, :$LOAD_PATH, :$", :$LOADED_FEATURES,
 :$VERBOSE, :$-v, :$-w, :$-W, :$DEBUG, :$-d, :$0, :$PROGRAM_NAME, :$-p, :$-l,
 :$-a, :$binding, :$1, :$2, :$3, :$4, :$5, :$6, :$7, :$8, :$9]

Here, only the first nine are listed:

$1, :$2, :$3, :$4, :$5, :$6, :$7, :$8, :$9

This is also confirmed by:

global_variables.include?(:$10) # => false

Where is $10 stored, and why isn’t it stored in global_variables?

like image 444
Sagar Pandya Avatar asked Jan 09 '16 09:01

Sagar Pandya


People also ask

What is capturing in regex?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .

What is the use of given statement in regular expression a za Z?

Using character sets For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.

What will the regular expression match?

By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.

Should I use regex?

Regular expressions are a powerful tool for working with formal languages. They aren't useful, though, when working with languages that aren't formal, such as markup languages. A common mistake when working with RegExes is to attempt to use them to parse HTML and XML.


3 Answers

Ruby seems to handle $1, $2 etc. at the parser level:

ruby --dump parsetree_with_comment -e '$100'

Output:

###########################################################
## Do NOT use this node dump for any purpose other than  ##
## debug and research.  Compatibility is not guaranteed. ##
###########################################################

# @ NODE_SCOPE (line: 1)
# | # new scope
# | # format: [nd_tbl]: local table, [nd_args]: arguments, [nd_body]: body
# +- nd_tbl (local table): (empty)
# +- nd_args (arguments):
# |   (null node)
# +- nd_body (body):
#     @ NODE_NTH_REF (line: 1)
#     | # nth special variable reference
#     | # format: $[nd_nth]
#     | # example: $1, $2, ..
#     +- nd_nth (variable): $100

BTW, the maximum number of capture groups is 32,767 and you can access all via $n:

/#{'()' * 32768}/       #=> RegexpError: too many capture groups are specified

/#{'()' * 32767}/ =~ '' #=> 0
defined? $32767         #=> "global-variable"
$32767                  #=> ""
like image 110
Stefan Avatar answered Sep 21 '22 06:09

Stefan


The numbered variables returned from Kernel#global_variables will always be the same, even before they are assigned. I.e. $1 through $9 will be returned even before you do the match, and matching more won't add to the list. (They can also not be assigned, e.g. using $10 = "foo".)

Consider the source code for the method:

VALUE
rb_f_global_variables(void)
{
    VALUE ary = rb_ary_new();
    char buf[2];
    int i;

    st_foreach_safe(rb_global_tbl, gvar_i, ary);
    buf[0] = '$';

    for (i = 1; i <= 9; ++i) {
        buf[1] = (char)(i + '0');
        rb_ary_push(ary, ID2SYM(rb_intern2(buf, 2)));
    }

    return ary;
}

You can (after getting used to looking at C) see from the for loop that the symbols $1 through $9 are hard coded into the return value of the method.

So how then, can you still use $10, if the output of the global_variables doesn't change? Well, the output might be a bit misleading, because it would suggest your match data is stored in separate variables, but these are just shortcuts, delegating to the MatchData object stored in $~.

Essentially $n looks at $~[n]. You'll find this MatchData object (coming from the global table) is part of the original output from the method, but it is not assigned until you do a match.

As to what the justification for including $1 through $9 in the output of the function, you would need to ask someone on the Ruby core team. It might seem arbitrary, but there is likely some deliberation that went into the decision.

like image 24
Drenmi Avatar answered Sep 19 '22 06:09

Drenmi


we consider this behavior as a bug. We fixed this in the trunk.

like image 45
Yukihiro Matsumoto Avatar answered Sep 22 '22 06:09

Yukihiro Matsumoto