Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby gsub / regex modifiers?

Tags:

regex

ruby

Where can I find the documentation on the modifiers for gsub? \a \b \c \1 \2 \3 %a %b %c $1 $2 %3 etc.?

Specifically, I'm looking at this code... something.gsub(/%u/, unit) what's the %u?

like image 387
Blaine Avatar asked Aug 05 '09 17:08

Blaine


People also ask

Does GSUB use regex?

Regular expressions (shortened to regex) are used to operate on patterns found in strings. They can find, replace, or remove certain parts of strings depending on what you tell them to do. In Ruby, they are always contained within two forward slashes.

What is GSUB in regex?

gsub stands for global substitution (replace everywhere). It replaces every occurrence of a regular expression (original string) with the replacement string in the given string.

What does =~ mean in Ruby?

=~ is Ruby's basic pattern-matching operator. When one operand is a regular expression and the other is a string then the regular expression is used as a pattern to match against the string. (This operator is equivalently defined by Regexp and String so the order of String and Regexp do not matter.

What is modifiers in Ruby?

Just like options change the default behavior of commands used from a terminal, modifiers are used to change aspects of regexp. They can be applied to entire regexp or to a particular portion of regexp, and both forms can be mixed up as well. The cryptic output of Regexp.


3 Answers

First off, %u is nothing special in ruby regex:

mixonic@pandora ~ $ irb
irb(main):001:0> '%u'.gsub(/%u/,'heyhey')
=> "heyhey"

The definitive documentation for Ruby 1.8 regex is in the Ruby Doc Bundle:

  • http://ruby-doc.org/docs/ruby-doc-bundle/Manual/man-1.4/syntax.html#regexp

Strings delimited by slashes are regular expressions. The characters right after latter slash denotes the option to the regular expression. Option i means that regular expression is case insensitive. Option i means that regular expression does expression substitution only once at the first time it evaluated. Option x means extended regular expression, which means whitespaces and commens are allowd in the expression. Option p denotes POSIX mode, in which newlines are treated as normal character (matches with dots).

The %r/STRING/ is the another form of the regular expression.

^
    beginning of a line or string 
$
    end of a line or string 
.
    any character except newline 
\w
    word character[0-9A-Za-z_] 
\W
    non-word character 
\s
    whitespace character[ \t\n\r\f] 
\S
    non-whitespace character 
\d
    digit, same as[0-9] 
\D
    non-digit 
\A
    beginning of a string 
\Z
    end of a string, or before newline at the end 
\z
    end of a string 
\b
    word boundary(outside[]only) 
\B
    non-word boundary 
\b
    backspace(0x08)(inside[]only) 
[ ]
    any single character of set 
*
    0 or more previous regular expression 
*?
    0 or more previous regular expression(non greedy) 
+
    1 or more previous regular expression 
+?
    1 or more previous regular expression(non greedy) 
{m,n}
    at least m but most n previous regular expression 
{m,n}?
    at least m but most n previous regular expression(non greedy) 
?
    0 or 1 previous regular expression 
|
    alternation 
( )
    grouping regular expressions 
(?# )
    comment 
(?: )
    grouping without backreferences 
(?= )
    zero-width positive look-ahead assertion 
(?! )
    zero-width negative look-ahead assertion 
(?ix-ix)
    turns on (or off) `i' and `x' options within regular expression.

These modifiers are localized inside an enclosing group (if any). (?ix-ix: ) turns on (or off) i' andx' options within this non-capturing group.

Backslash notation and expression substitution available in regular expressions.

Good luck!

like image 190
mixonic Avatar answered Nov 11 '22 16:11

mixonic


Zenspider's Quickref contains a section explaining which escape sequences can be used in regexen and one listing the pseudo variables that get set by a regexp match. In the second argument to gsub you simply write the name of the variable with a backslash instead of a $ and it will be replaced with the value of that variable after applying the regexp. If you use a double quoted string, you need to use two backslashes.

When using the block-form of gsub you can simply use the variables directly. If you return a string containing e.g. \1 from the block, that will not be replaced with $1. That only happens when using the two-argument form.

like image 38
sepp2k Avatar answered Nov 11 '22 16:11

sepp2k


If you use block in sub/gsub you can access to the groups like that :

>> rx = /(ab(cd)ef)/
>> s = "-abcdef-abcdef"
>> s.gsub(rx) { $2 }
=> "cdgh-cdghi"
like image 5
Stan Avatar answered Nov 11 '22 17:11

Stan