Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx to match <a> html tags without specific attribute

Tags:

java

regex

In Java I need to match <a> tags in a string that do not have href attribute. For example in the following string:

text <a class="aClass" href="#">link1</a> text <a class="aClass" target="_blank">link2</a> text

it should not match <a class="aClass" href="#">link1</a> (because it contains href) but it should match <a class="aClass" target="_blank">link2</a> (because it does not contain href).

I managed to build the RegEx to match my tags:

<a[^>]*>(.*?)</a>

but I can not figure out how to eliminate tags with href

(I know I can use HTML parsers etc but I need to do this with RegEx.

like image 739
user2287359 Avatar asked Jun 19 '13 20:06

user2287359


2 Answers

Description

Be careful with regexs like <a[^>]* as these will also match other valid html tags which start with an a such as <abbr> or <address>. Also simply looking for the existence of the string href isn't good enough as that string could be inside the value of another attribute or such as <a class="thishrefstuff"..., or part of another attribute like <a hreflang="en"...

This expression will:

  • match all anchor tags <a...</a> which don't contain a href attribute.
  • It will enforce the tag name is a and not a tag which simply starts with the letter a like <address>
  • ignore attributes which also have the substring href embedded in the name of the attribute like the valid hreflang='en' or the made up Attributehref="some value".
  • ignore all characters inside the value portion of all properly formatted attributes like bogus='href=""'

<a(?=\s|>)(?!(?:[^>=]|=(['"])(?:(?!\1).)*\1)*?\shref=['"])[^>]*>.*?<\/a>

enter image description here

Expanded

  • <a(?=\s|>) match the open tag and ensure the next after the tag name is either a space or the close bracket, this forces the name to be a and not something else
  • (?! start the negative look ahead this if we find an href in this tag then this type of tag isn't the tag we're looking for
    • (?: start non capture group to move through all characters inside the tag
    • [^>=] match all non tag closing characters which prevents the regex engine from leaving the tag, and non equal signs which prevents the engine from continuing blindly matching all characters
    • | or
    • =(['"]) match an equal sign followed by an open double or single quote. the quote is captured into group 2 so it can be correctly paired later
    • (?:(?!\1).)* match all characters which are not the a close quote that matches the open quote
    • \1 match the correct close quote
    • )*? close the non capture group and repeat is as often as necessary until
    • \shref=['"] matching the desired href attribute. The \s and =["'] ensures the attribute name is simply href
    • ) close the negative lookahead
  • [^>]*>.*?<\/a> match the entire string from open to close

Java Code Example:

Input text

<abbr>RADIO</abbr> text <a class="aClass" href="#">link1</a> text <a bogus='href=""' class="aClass" target="_blank">link2</a> text

Code

If you're looking to use this in a replace function to remove non-href-anchor tags then just replace all matches with nothing.

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "source string to match with pattern";
  Pattern re = Pattern.compile("<a(?=\\s|>)(?!(?:[^>=]|=(['\"])(?:(?!\\1).)*\\1)*?\\shref=['\"])[^>]*>.*?<\\/a>
",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Matches

$matches Array:
(
    [0] => Array
        (
            [0] => <a bogus='href=""' class="aClass" target="_blank">link2</a>
        )

    [1] => Array
        (
            [0] => 
        )

)
like image 146
Ro Yo Mi Avatar answered Oct 15 '22 15:10

Ro Yo Mi


I find it odd that you would need to do it with regex, but you can use a negative lookahead.

<a(?![^>]+href).*?>(.*?)</a>
like image 34
Explosion Pills Avatar answered Oct 15 '22 16:10

Explosion Pills