Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java: I have a big string of html and need to extract the href="..." text

I have this string containing a large chunk of html and am trying to extract the link from href="..." portion of the string. The href could be in one of the following forms:

<a href="..." />
<a class="..." href="..." />

I don't really have a problem with regex but for some reason when I use the following code:

        String innerHTML = getHTML(); 
  Pattern p = Pattern.compile("href=\"(.*)\"", Pattern.DOTALL);
  Matcher m = p.matcher(innerHTML);
  if (m.find()) {
   // Get all groups for this match
   for (int i=0; i<=m.groupCount(); i++) {
    String groupStr = m.group(i);
    System.out.println(groupStr);

   }
  }

Can someone tell me what is wrong with my code? I did this stuff in php but in Java I am somehow doing something wrong... What is happening is that it prints the whole html string whenever I try to print it...

EDIT: Just so that everyone knows what kind of a string I am dealing with:

<a class="Wrap" href="item.php?id=43241"><input type="button">
    <span class="chevron"></span>
  </a>
  <div class="menu"></div>

Everytime I run the code, it prints the whole string... That's the problem...

And about using jTidy... I'm on it but it would be interesting to know what went wrong in this case as well...

like image 679
Legend Avatar asked Nov 03 '09 22:11

Legend


2 Answers

.* 

This is an greedy operation that will take any character including the quotes.

Try something like:

"href=\"([^\"]*)\""
like image 126
Kugel Avatar answered Sep 21 '22 07:09

Kugel


There are two problems with the code you've posted:

Firstly the .* in your regular expression is greedy. This will cause it to match all characters until the last " character that can be found. You can make this match be non-greedy by changing this to .*?.

Secondly, to pick up all the matches, you need to keep iterating with Matcher.find rather than looking for groups. Groups give you access to each parenthesized section of the regex. You however, are looking for each time the whole regular expression matches.

Putting these together gives you the following code which should do what you need:

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);

while (m.find()) 
{
    System.out.println(m.group(1));
}
like image 31
Phil Ross Avatar answered Sep 21 '22 07:09

Phil Ross