Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use regular expressions to parse HTML in Java?

Tags:

java

regex

Please can someone tell me a simple way to find href and src tags in an html file using regular expressions in Java?
And then, how do I get the URL associated with the tag?

Thanks for any suggestion.

like image 301
Ricardo Felgueiras Avatar asked Mar 24 '09 11:03

Ricardo Felgueiras


People also ask

Can you use regular expressions to parse HTML?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Can regular expressions be used in Java?

Regular expressions can be used to perform all types of text search and text replace operations. Java does not have a built-in Regular Expression class, but we can import the java. util. regex package to work with regular expressions.

What is a Java HTML parser?

What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What is \\ w+ in Java regex?

A succinct version: \\w+ matches all alphanumeric characters and _ . \\W+ matches all characters except alphanumeric characters and _ . They are opposite.


2 Answers

Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.

Use an HTML Parser instead. See also What are the pros and cons of the leading Java HTML parsers?

like image 73
Dave Webb Avatar answered Sep 23 '22 11:09

Dave Webb


The other answers are true. Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.

If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:

String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
   System.out.println(m.group(0));
   System.out.println(m.group(1));
}

And the output is:

<a href='link1'>
link1
<a href='link2'>
link2

Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).

like image 35
Henryk Konsek Avatar answered Sep 23 '22 11:09

Henryk Konsek