I would like to patch some text data extracted from web pages. sample:
t="First sentence. Second sentence.Third sentence."
There is no space after the point at the end of the second sentence. This sign me that the 3rd sentence was in a separate line (after a br tag) in the original document.
I want to use this regexp to insert "\n" character into the proper places and patch my text. My regex:
t2=t.gsub(/([.\!?])([A-Z1-9])/,$1+"\n"+$2)
But unfortunately it doesn't work: "NoMethodError: undefined method `+' for nil:NilClass" How can I properly backreference to the matched groups? It was so easy in Microsoft Word, I just had to use \1 and \2 symbols.
Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .
tl;dr non-capturing groups, as the name suggests are the parts of the regex that you do not want to be included in the match and ?: is a way to define a group as being non-capturing. Let's say you have an email address [email protected] . The following regex will create two groups, the id part and @example.com part.
=~ is Ruby's pattern-matching operator. It matches a regular expression on the left to a string on the right. If a match is found, the index of first match in string is returned. If the string cannot be found, nil will be returned.
The “sub” in “gsub” stands for “substitute”, and the “g” stands for “global”. Here is an example string: str = "white chocolate" Let's say that we want to replace the word “white” with the word “dark”. Here's how: str.gsub("white", "dark")
You can backreference in the substitution string with \1
(to match capture group 1).
t = "First sentence. Second sentence.Third sentence!Fourth sentence?Fifth sentence."
t.gsub(/([.!?])([A-Z1-9])/, "\\1\n\\2") # => "First sentence. Second sentence.\nThird sentence!\nFourth sentence?\nFifth sentence."
gsub(regex, replacement)
, then use '\1'
, '\2'
, ... to refer to the match. Make sure not to put double quotes around the replacement
, or else escape the backslash as in Joshua's answer. The conversion from '\1'
to the match will be done within gsub
, not by literal interpretation.gsub(regex){replacement}
, then use $1
, $1
, ...But for your case, it is easier not to use matches:
t2 = t.gsub(/(?<=[.\!?])(?=[A-Z1-9])/, "\n")
If you got here because of Rubocop complaining "Avoid the use of Perl-style backrefs." about $1, $2, etc... you can can do this instead:
some_id = $1
# or
some_id = Regexp.last_match[1] if Regexp.last_match
some_id = $5
# or
some_id = Regexp.last_match[5] if Regexp.last_match
It'll also want you to do
%r{//}.match(some_string)
instead of
some_string[//]
Lame (Rubocop)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With