If I do a match with a regular expression with ten captures:
/(o)(t)(th)(f)(fi)(s)(se)(e)(n)(t)/.match("otthffisseent")
then, for $10
, I get:
$10 # => "t"
but it is missing from global_variables
. I get (in an irb session):
[:$;, :$-F, :$@, :$!, :$SAFE, :$~, :$&, :$`, :$', :$+, :$=, :$KCODE, :$-K, :$,,
:$/, :$-0, :$\, :$_, :$stdin, :$stdout, :$stderr, :$>, :$<, :$., :$FILENAME,
:$-i, :$*, :$?, :$$, :$:, :$-I, :$LOAD_PATH, :$", :$LOADED_FEATURES,
:$VERBOSE, :$-v, :$-w, :$-W, :$DEBUG, :$-d, :$0, :$PROGRAM_NAME, :$-p, :$-l,
:$-a, :$binding, :$1, :$2, :$3, :$4, :$5, :$6, :$7, :$8, :$9]
Here, only the first nine are listed:
$1, :$2, :$3, :$4, :$5, :$6, :$7, :$8, :$9
This is also confirmed by:
global_variables.include?(:$10) # => false
Where is $10
stored, and why isn’t it stored in global_variables
?
Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .
Using character sets For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.
By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.
Regular expressions are a powerful tool for working with formal languages. They aren't useful, though, when working with languages that aren't formal, such as markup languages. A common mistake when working with RegExes is to attempt to use them to parse HTML and XML.
Ruby seems to handle $1
, $2
etc. at the parser level:
ruby --dump parsetree_with_comment -e '$100'
Output:
###########################################################
## Do NOT use this node dump for any purpose other than ##
## debug and research. Compatibility is not guaranteed. ##
###########################################################
# @ NODE_SCOPE (line: 1)
# | # new scope
# | # format: [nd_tbl]: local table, [nd_args]: arguments, [nd_body]: body
# +- nd_tbl (local table): (empty)
# +- nd_args (arguments):
# | (null node)
# +- nd_body (body):
# @ NODE_NTH_REF (line: 1)
# | # nth special variable reference
# | # format: $[nd_nth]
# | # example: $1, $2, ..
# +- nd_nth (variable): $100
BTW, the maximum number of capture groups is 32,767 and you can access all via $n
:
/#{'()' * 32768}/ #=> RegexpError: too many capture groups are specified
/#{'()' * 32767}/ =~ '' #=> 0
defined? $32767 #=> "global-variable"
$32767 #=> ""
The numbered variables returned from Kernel#global_variables
will always be the same, even before they are assigned. I.e. $1
through $9
will be returned even before you do the match, and matching more won't add to the list. (They can also not be assigned, e.g. using $10 = "foo"
.)
Consider the source code for the method:
VALUE
rb_f_global_variables(void)
{
VALUE ary = rb_ary_new();
char buf[2];
int i;
st_foreach_safe(rb_global_tbl, gvar_i, ary);
buf[0] = '$';
for (i = 1; i <= 9; ++i) {
buf[1] = (char)(i + '0');
rb_ary_push(ary, ID2SYM(rb_intern2(buf, 2)));
}
return ary;
}
You can (after getting used to looking at C) see from the for loop that the symbols $1
through $9
are hard coded into the return value of the method.
So how then, can you still use $10
, if the output of the global_variables
doesn't change? Well, the output might be a bit misleading, because it would suggest your match data is stored in separate variables, but these are just shortcuts, delegating to the MatchData
object stored in $~
.
Essentially $n
looks at $~[n]
. You'll find this MatchData
object (coming from the global table) is part of the original output from the method, but it is not assigned until you do a match.
As to what the justification for including $1
through $9
in the output of the function, you would need to ask someone on the Ruby core team. It might seem arbitrary, but there is likely some deliberation that went into the decision.
we consider this behavior as a bug. We fixed this in the trunk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With