In Java I need to match <a>
tags in a string that do not have href attribute. For example in the following string:
text <a class="aClass" href="#">link1</a> text <a class="aClass" target="_blank">link2</a> text
it should not match <a class="aClass" href="#">link1</a>
(because it contains href) but it should match <a class="aClass" target="_blank">link2</a>
(because it does not contain href).
I managed to build the RegEx to match my tags:
<a[^>]*>(.*?)</a>
but I can not figure out how to eliminate tags with href
(I know I can use HTML parsers etc but I need to do this with RegEx.
Be careful with regexs like <a[^>]*
as these will also match other valid html tags which start with an a
such as <abbr>
or <address>
. Also simply looking for the existence of the string href
isn't good enough as that string could be inside the value of another attribute or such as <a class="thishrefstuff"...
, or part of another attribute like <a hreflang="en"...
This expression will:
<a
...</a>
which don't contain a href
attribute.a
and not a tag which simply starts with the letter a
like <address>
href
embedded in the name of the attribute like the valid hreflang='en'
or the made up Attributehref="some value"
.bogus='href=""'
<a(?=\s|>)(?!(?:[^>=]|=(['"])(?:(?!\1).)*\1)*?\shref=['"])[^>]*>.*?<\/a>
<a(?=\s|>)
match the open tag and ensure the next after the tag name is either a space or the close bracket, this forces the name to be a
and not something else(?!
start the negative look ahead this if we find an href in this tag then this type of tag isn't the tag we're looking for
(?:
start non capture group to move through all characters inside the tag[^>=]
match all non tag closing characters which prevents the regex engine from leaving the tag, and non equal signs which prevents the engine from continuing blindly matching all characters|
or =(['"])
match an equal sign followed by an open double or single quote. the quote is captured into group 2 so it can be correctly paired later(?:(?!\1).)*
match all characters which are not the a close quote that matches the open quote \1
match the correct close quote)*?
close the non capture group and repeat is as often as necessary until\shref=['"]
matching the desired href attribute. The \s
and =["']
ensures the attribute name is simply href)
close the negative lookahead[^>]*>.*?<\/a>
match the entire string from open to closeInput text
<abbr>RADIO</abbr> text <a class="aClass" href="#">link1</a> text <a bogus='href=""' class="aClass" target="_blank">link2</a> text
Code
If you're looking to use this in a replace function to remove non-href-anchor tags then just replace all matches with nothing.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
public static void main(String[] asd){
String sourcestring = "source string to match with pattern";
Pattern re = Pattern.compile("<a(?=\\s|>)(?!(?:[^>=]|=(['\"])(?:(?!\\1).)*\\1)*?\\shref=['\"])[^>]*>.*?<\\/a>
",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
Matcher m = re.matcher(sourcestring);
int mIdx = 0;
while (m.find()){
for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
}
mIdx++;
}
}
}
Matches
$matches Array:
(
[0] => Array
(
[0] => <a bogus='href=""' class="aClass" target="_blank">link2</a>
)
[1] => Array
(
[0] =>
)
)
I find it odd that you would need to do it with regex, but you can use a negative lookahead.
<a(?![^>]+href).*?>(.*?)</a>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With