Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Regex to get the text from HTML anchor (<a>...</a>) tags

Tags:

java

regex

I'm trying to get a text within a certain tag. So if I have:

<a href="http://something.com">Found<a/>

I want to be able to retrieve the Found text.

I'm trying to do it using regex. I am able to do it if the <a href="http://something.com> stays the same but it doesn't.

So far I have this:

Pattern titleFinder = Pattern.compile( ".*[a-zA-Z0-9 ]* ([a-zA-Z0-9 ]*)</a>.*" );

I think the last two parts - the ([a-zA-Z0-9 ]*)</a>.* - are ok but I don't know what to do for the first part.

like image 967
BeginnerPro Avatar asked Jan 07 '11 18:01

BeginnerPro


People also ask

What is anchor(< a>) tag in HTML?

An anchor is a piece of text which marks the beginning and/or the end of a hypertext link. The text between the opening tag and the closing tag is either the start or destination (or both) of a link. Attributes of the anchor tag are as follows. HREF. OPTIONAL.

What does \b mean in regex Java?

In Java, "\b" is a back-space character (char 0x08 ), which when used in a regex will match a back-space literal.

How to use HTML anchor tag?

HTML <a> Tag. The <a> tag (anchor tag) in HTML is used to create a hyperlink on the webpage. This hyperlink is used to link the webpage to other web pages or some section of the same web page. It's either used to provide an absolute reference or a relative reference as its “href” value.

What does this mean in regex \\ s *?

\\s*,\\s* It says zero or more occurrence of whitespace characters, followed by a comma and then followed by zero or more occurrence of whitespace characters. These are called short hand expressions. You can find similar regex in this site: http://www.regular-expressions.info/shorthand.html.


1 Answers

As they said, don't use regex to parse HTML. If you are aware of the shortcomings, you might get away with it, though. Try

Pattern titleFinder = Pattern.compile("<a[^>]*>(.*?)</a>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = titleFinder.matcher(subjectString);
while (regexMatcher.find()) {
    // matched text: regexMatcher.group(1)
} 

will iterate over all matches in a string.

It won't handle nested <a> tags and ignores all the attributes inside the tag.

like image 59
Tim Pietzcker Avatar answered Sep 28 '22 02:09

Tim Pietzcker