Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace the string between opening and closing anchor tags with some other string

Tags:

regex

replace

I need to replace the string between a pair of anchor tags with some other string. To be more clear:

<a blah blah>Click Here</a>

I want to replace 'Click Here' with an <img src=... /> tag. I read around a couple of other resources, tried hard at Lars Olav Torvik's regex tool, but failed badly!

Please help me out!

like image 835
Rutwick Gangurde Avatar asked Dec 13 '22 06:12

Rutwick Gangurde


1 Answers

Don't use regex to parse HTML!

Yes, in general, using regex to parse HTML is fraught with peril. Computer scientists will correctly point out that HTML is not a REGULAR language. However, contrary to what many here believe, there are cases where using a regex solution is perfectly valid and appropriate. Read Jeff Atwoods's blog post on this very subject: Parsing Html The Cthulhu Way. That disclaimer aside, let's forge ahead with a regex solution...

Re-Statement of the problem:

The original question is pretty vague. Here is a more precise (possibly not at all what the OP is asking) interpretation/reformulation of the question:

Given: We have some HTML text (either HTML 4.01 or XHTML 1.0). This text contains <A..>...</A> anchor elements. Some of these anchor elements are links to an image file resource (i.e. the HREF attribute points to a URI ending with a file extension of: JPEG, JPG, PNG or GIF). Some of these links to images, are simple text links, where the content of the anchor element is plain text having no other HTML elements, e.g. <a href="picture.jpg">Link text with no HTML tags</a>.

Find: Is there a regex solution that will take these "plain-text-link-to-image-resource-file" links, and replace the link text with an IMG element having a SRC attribute set to the same image URI resource? The following (valid HTML 4.01) example input has three paragraphs. All the links in the first paragraph are to be modified but all the links in the second and third paragraphs are NOT to be modified and left as-is:

Example HTML input:

<p title="Image links with plain text contents to be modified">
    This is a <a href="img1.png">LINK 1</a> simple anchor link to image.
    This <a title="<>" href="img2.jpg">LINK 2</a> has attributes before HREF.
    This <a href="img3.gif" title='<>'>LINK 3</a> has attributes after HREF.
</p>
<p title="NON-image links with plain text contents NOT to be modified">
    This is a <a href="tmp1.txt">LINK 1</a> simple anchor link to NON-image.
    This <a title="<>" href="tmp2.txt">LINK 2</a> has attributes before HREF.
    This <a href="tmp3.txt" title='<>'>LINK 3</a> has attributes after HREF.
</p>
<p title="Image links with NON-plain text contents NOT to be modified">
    This is a <a href="img1.png"><b>BOLD 1</b></a> anchor link to image.
    This is an <a href="img3.gif"><img src="img3.gif"/></a> image link to image.
</p>

Desired HTML output:

<p title="Image links with plain text contents to be modified">
    This is a <a href="img1.png"><img src="img1.png" /></a> simple anchor link to image.
    This <a title="<>" href="img2.jpg"><img src="img2.jpg" /></a> has attributes before HREF.
    This <a href="img3.gif" title='<>'><img src="img3.gif" /></a> has attributes after HREF.
</p>
<p title="NON-image links with plain text contents NOT to be modified">
    This is a <a href="tmp1.txt">LINK 1</a> simple anchor link to NON-image.
    This <a title="<>" href="tmp2.txt">LINK 2</a> has attributes before HREF.
    This <a href="tmp3.txt" title='<>'>LINK 3</a> has attributes after HREF.
</p>
<p title="Image links with NON-plain text contents NOT to be modified">
    This is a <a href="img1.png"><b>BOLD 1</b></a> anchor link to image.
    This is an <a href="img3.gif"><img src="img3.gif"/></a> image link to image.
</p>

Note that these examples include test case <A..>...</A> anchor tags have both single and double quoted attribute values both before and after the desired HREF attribute, and which contain cthulhu tempting, (yet perfectly valid HTML 4.01), angle brackets.

Note also that the replacement text is an (empty) IMG tag ending in: '/>' (which is NOT valid HTML 4.01).

A regex solution:

The statement of the problem defines a highly specific pattern to be matched which has the following requirements:

  • The <A..>...</A> start tag may have any number of attributes before and/or after the HREF attribute.
  • The HREF attribute value must have a value ending with JPEG, JPG, PNG or GIF (case-insensitive).
  • The contents of the <A..>...</A> element may NOT contain any other HTML tags.
  • The <A..>...</A> element target pattern is NOT a nested structure.

When dealing with such highly specific sub-strings, a well crafted regex solution can work very well (with very few edge cases that can trip it up). Here is a tested PHP function that will do a pretty good job (and correctly transform the above example input):

// Convert text-only contents of image links to IMG element.
function textLinksToIMG($text) {
    $re = '% # Match A element with image URL and text-only contents.
        (                     # Begin $1: A element start tag.
          <a                  # Start of A element start tag.
            (?:               # Zero or more attributes before HREF.
              \s+             # Whitespace required before attribute.
              (?!href\b)      # Match attributes other than HREF.
              [\w\-.:]+       # Attribute name (Non-HREF).
              (?:             # Attribute value is optional.
                \s*=\s*       # Attrib name and value separated by =.
                (?:           # Group for attrib value alternatives.
                  "[^"]*"     # Either double quoted,
                | \'[^\']*\'  # or single quoted,
                | [\w\-.:]+   # or unquoted value.
                )             # End group of value alternatives.
              )?              # Attribute value is optional.
            )*                # Zero or more attributes before HREF.
            \s+               # Whitespace required before attribute.
            href\s*=\s*       # HREF attribute name.
            (?|               # Branch reset group for $2: HREF value.
              "([^"]*)"       # Either $2.1: double quoted,
            | \'([^\']*)\'    # or $2.2: single quoted,
            | ([\w\-.:]+)     # or $2.3: unquoted value.
            )                 # End group of HREF value alternatives.
            (?<=              # Look behind to assert HREF value was...
              jpeg[\'"]       # either JPEG,
            | jpg[\'"]        # or JPG,
            | png[\'"]        # or PNG,
            | gif[\'"]        # or GIF,
            )                 # End look behind assertion.
            (?:               # Zero or more attributes after HREF.
              \s+             # Whitespace required before attribute.
              [\w\-.:]+       # Attribute name.
              (?:             # Attribute value is optional.
                \s*=\s*       # Attrib name and value separated by =.
                (?:           # Group for attrib value alternatives.
                  "[^"]*"     # Either double quoted,
                | \'[^\']*\'  # or single quoted,
                | [\w\-.:]+   # or unquoted value.
                )             # End group of value alternatives.
              )?              # Attribute value is optional.
            )*                # Zero or more attributes after HREF.
          \s*                 # Allow whitespace before closing >
          >                   # End of A element start tag.
        )                     # End $1: A element start tag.
        ([^<>]*)              # $3: A element contents (text-only).
        (</a\s*>)             # $4: A element end tag.
        %ix';
    return preg_replace($re, '$1<img src="$2" />$4', $text);
}

Yes the regex in this solution is long, but this is mostly due to the extensive commenting, which also makes it highly readable. It also correctly handles quoted attribute values that may contain angle brackets. Yes, it is certainly possible to create some HTML markup that will break this solution, but the required code to do so would be so convoluted as to be virtually unheard of.

like image 184
ridgerunner Avatar answered Jan 31 '23 08:01

ridgerunner