Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching selected option of specific html tag in java Regex

Tags:

java

regex

I have to parse some html to find a set of values from some HTML which isn't always well formed and I have no control over (so Scanner does not seem to be an option)

This is a shopping cart, and within the cart is n number of rows each containing a quantity dropdown. Now I want to be able to get the sum total of products in the cart.

Given this html, I would want to match the values 2 and 5

...
<select attr="other stuff" name="quantity">
    <option value="1" />
    <option value="2" selected="selected" />
</select>
....
<select name="quantity" attr="other stuff">
    <option selected="selected" value="5" />
    <option value="6" />
</select>

I've made a number of pitiful attempts but given the number of variables (for example order of the 'value' and 'selected' tags) most of my solutions either don't work or are really slow.

The last Java code I ended with is the following

Pattern pattern = Pattern.compile("select(.*?)name=\"quantity\"([.|\\n|\\r]*?)option(.*?)value=\"(/d)\" selected=\"selected\"", Pattern.DOTALL);
Matcher matcher = pattern.matcher(html);
if (matcher.find()) {
   ....
}

It's very slow and does not work when attribute order changes. My Regex knowledge is not good enough to write an efficient pattern

like image 489
Nick Cardoso Avatar asked Jan 07 '23 01:01

Nick Cardoso


2 Answers

Instead of using a regular expression, you can use an XPath expression to retrieve all value attributes for the HTML you have in the question:

//select[@name="quantity"]/option[@selected="selected"]/@value

In words:

  • Find all <select> elements within the XML with attribute name equal to quantity, with a subelement <option> with an attribute selected equal to selected
  • Retrieve the value attributes.

I would really consider trying with an XQuery/XPath, that's what it is made for. Read this answer to the question How to read XML using XPath in Java on how to retrieve the values. An introduction on XPath expressions here.


Consider the situation where in the future you then need to only find options where attribute selected="selected" and eg status="accepted". The XPath expression would simply become:

//select[@name="quantity"]/option[@selected="selected" and @status="accepted"]/@value

The XPath expression is easy to extend, easy to review, easy to prove what it is doing.

Now what kind of RegEx monster would you have to create for the added condition? Hard to write, even harder to maintain. How can a code-reviewer tell what the complex (cf bobble bubble's answer) regular expression is doing? How do you prove that the regular expression is actually doing what it is supposed to do?

You can of course document the regular expression, something you should always do for regular expressions. But that doesn't prove anything.

My advice: Stay away from regular expressions unless there is absolutely no other way.


For sports I made a snippet showing the basics of this way of working:

import java.io.StringReader;
import javax.xml.xpath.*;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public class ReadElementsFromHtmlUsingXPath {
    private static final String html=
"<html>Read more about XPath <a href=\"www.w3schools.com/xsl/xpath_intro.asp\">here</a>..."+
"<select attr=\"other stuff\" name=\"quantity\">"+
    "<option value=\"1\" />"+
    "<option value=\"2\" selected=\"selected\" />"+
"</select>"+
"<i><b>Oh and here's the second element</b></i>"+
"<select name=\"quantity\" attr=\"other stuff\">"+
    "<option selected=\"selected\" value=\"5\" />"+
    "<option value=\"6\" />"+
"</select>"+
"And that's all folks</html>";

    private static final String xpathExpr = 
"//select[@name=\"quantity\"]/option[@selected=\"selected\"]/@value";

    public static void main(String[] args) {
        try {
            XPath xpath = XPathFactory.newInstance().newXPath();
            XPathExpression expr = xpath.compile(xpathExpr);
            NodeList nodeList = (NodeList) expr.evaluate(new InputSource(new StringReader(html)),XPathConstants.NODESET);
            for( int i = 0; i != nodeList.getLength(); ++i )
                System.out.println(nodeList.item(i).getNodeValue());
        } catch (XPathExpressionException e) {
            e.printStackTrace();
        }
    }
}

Result in output:

2
5
like image 173
TT. Avatar answered Jan 24 '23 23:01

TT.


Surely depends on how malformed your html could be. Parser solution to be preferred.

A regex that matches your requirement is not much of a challenge, just putting it together.

(?xi) # i-flag for caseless, x-flag for comments (free spacing mode) 

# 1.) match <select with optional space at the end
<\s*select\s[^>]*?\bname\s*=\s*["']\s*quantity[^>]*>\s*

# 2.) match lazily any amount of options until the "selected"
(?:<\s*option[^>]*>\s*)*?

# 3.) match selected using a lookahead and capture number from value
<\s*option\s(?=[^>]*?\bselected)[^>]*?\bvalue\s*=\s*["']\s*(\d[.,\d]*)

Try pattern at regex101 or RegexPlanet (Java) and as a Java String:

"(?i)<\\s*select\\s[^>]*?\\bname\\s*=\\s*[\"']\\s*quantity[^>]*>\\s*(?:<\\s*option[^>]*>\\s*)*?<\\s*option\\s(?=[^>]*?\\bselected)[^>]*?\\bvalue\\s*=\\s*[\"']\\s*(\\d[.,\\d]*)"

There is not much magic in it. A long ugly pattern mostly because parsing html.

  • \s is a short for whitespace [ \t\r\n\f]
  • \d is a short for digit [0-9]
  • \b matches a word boundary
  • (?: opens a non capturing group
  • [^>] is the negation of > (matches characters, that are not >)
  • (?=[^>]*?\bselected) the check is done by use of a lookahead for being independent of order
  • (\d[.,\d]*) part to capture the number. Required is one digit with any optional [.,\d]

Matches would be in group(1) the first capturing group (parenthesized group).

like image 28
bobble bubble Avatar answered Jan 24 '23 23:01

bobble bubble