Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex match keywords that are not in quotes

Tags:

c#

regex

parsing

How will I be able to look for kewords that are not inside a string.

For example if I have the text:

Hello this text is an example.

bla bla bla "this text is inside a string"

"random string" more text bla bla bla "foo"

I will like to be able to match all the words text that are not inside " ". In other I will like to match:

enter image description here

note I do not want to match the text that is highlighted on red because it is inside a string


Possible solution:

I been working on it and this is what I have so far:

(?s)((?<q>")|text)(?(q).*?"|)

note that regex uses the if statement as: (?(predicate) true alternative|false alternative)

so the regex will read:

find " or text. If you find " then continue selecting until you find " again (.*?") if you find text then do nothing...

when I run that regex I match the whole string though. I am asking this question for purposes of learning. I know I can remove all strings then look for what I need.

like image 229
Tono Nam Avatar asked Jul 23 '12 20:07

Tono Nam


People also ask

Do I need to escape quotes in regex?

In order to use a literal ^ at the start or a literal $ at the end of a regex, the character must be escaped. Some flavors only use ^ and $ as metacharacters when they are at the start or end of the regex respectively. In those flavors, no additional escaping is necessary. It's usually just best to escape them anyway.

Are quotes special characters in regex?

Firstly, double quote character is nothing special in regex - it's just another character, so it doesn't need escaping from the perspective of regex. However, because Java uses double quotes to delimit String constants, if you want to create a string in Java with a double quote in it, you must escape them.

How do you match periods in regex?

The period (.) represents the wildcard character. Any character (except for the newline character) will be matched by a period in a regular expression; when you literally want a period in a regular expression you need to precede it with a backslash.

How do you include a quote in regex?

Try putting a backslash ( \ ) followed by &quot; .


1 Answers

Here is one answer:

(?<=^([^"]|"[^"]*")*)text

This means:

(?<=       # preceded by...
^          # the start of the string, then
([^"]      # either not a quote character
|"[^"]*"   # or a full string
)*         # as many times as you want
)
text       # then the text

You can easily extend this to handle strings containing escapes as well.

In C# code:

Regex.Match("bla bla bla \"this text is inside a string\"",
            "(?<=^([^\"]|\"[^\"]*\")*)text", RegexOptions.ExplicitCapture);

Added from comment discussion - extended version (match on a per-line basis and handle escapes). Use RegexOptions.Multiline for this:

(?<=^([^"\r\n]|"([^"\\\r\n]|\\.)*")*)text

In a C# string this looks like:

"(?<=^([^\"\r\n]|\"([^\"\\\\\r\n]|\\\\.)*\")*)text"

Since you now want to use ** instead of " here is a version for that:

(?<=^([^*\r\n]|\*(?!\*)|\*\*([^*\\\r\n]|\\.|\*(?!\*))*\*\*)*)text

Explanation:

(?<=       # preceded by
^          # start of line
 (         # either
 [^*\r\n]| #  not a star or line break
 \*(?!\*)| #  or a single star (star not followed by another star)
  \*\*     #  or 2 stars, followed by...
   ([^*\\\r\n] # either: not a star or a backslash or a linebreak
   |\\.        # or an escaped char
   |\*(?!\*)   # or a single star
   )*          # as many times as you want
  \*\*     # ended with 2 stars
 )*        # as many times as you want
)
text      # then the text

Since this version doesn't contain " characters it's cleaner to use a literal string:

@"(?<=^([^*\r\n]|\*(?!\*)|\*\*([^*\\\r\n]|\\.|\*(?!\*))*\*\*)*)text"
like image 200
porges Avatar answered Sep 22 '22 14:09

porges