Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx: Don't match a certain character if it's inside quotes

Disclosure: I have read this answer many times here on SO and I know better than to use regex to parse HTML. This question is just to broaden my knowledge with regex.

Say I have this string:

some text <tag link="fo>o"> other text

I want to match the whole tag but if I use <[^>]+> it only matches <tag link="fo>.

How can I make sure that > inside of quotes can be ignored.

I can trivially write a parser with a while loop to do this, but I want to know how to do it with regex.

like image 279
steve Avatar asked Mar 04 '14 06:03

steve


2 Answers

Regular Expression:

<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>

Online demo:

http://regex101.com/r/yX5xS8

Full Explanation:

I know this regex might be a headache to look at, so here is my explanation:

<                      # Open HTML tags
    [^>]*?             # Lazy Negated character class for closing HTML tag
    (?:                # Open Outside Non-Capture group
        (?:            # Open Inside Non-Capture group
            ('|")      # Capture group for quotes, backreference group 1
            [^'"]*?    # Lazy Negated character class for quotes
            \1         # Backreference 1
        )              # Close Inside Non-Capture group
        [^>]*?         # Lazy Negated character class for closing HTML tag
    )*                 # Close Outside Non-Capture group
>                      # Close HTML tags
like image 185
Vasili Syrakis Avatar answered Oct 20 '22 04:10

Vasili Syrakis


This is a slight improvement on Vasili Syrakis answer. It handles "…" and '…' completely separately, and does not use the *? qualifier.

Regular expression

<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>

Demo

http://regex101.com/r/jO1oQ1

Explanation

<                    # start of HTML tag
    [^'">]*          #   any non-single, non-double quote or greater than
    (                #   outer group
        (            #     inner group
            "[^"]*"  #       "..."
        |            #      or
            '[^']*'  #       '...'
        )            #
        [^'">]*      #   any non-single, non-double quote or greater than
    )*               #   zero or more of outer group
>                    # end of HTML tag

This version is slightly better than Vasilis's in that single quotes are allowed inside "…", and double quotes are allowed inside '…', and that a (incorrect) tag like <a href='> will not be matched.

It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace ( with (?:, in all places. (Just using ( makes the regex shorter, and a little bit more readable).

like image 33
zrajm Avatar answered Oct 20 '22 04:10

zrajm