Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pattern optimization

I need to scrape some content from a HTTP response with Java. The required fields in the response are: foo, bar and bla. My current pattern is very slow. Any ideas how to improve that?

Response:

...
<div class="ui-a">
<div class="ui-b">
    <p><strong>foo</strong></p>
    <p>bar</p>
</div>
<div class="ui-c">
    <p><strong>bla</strong></p>
    <p>...</p>
</div>
</div>

<div class="ui-a">
<div class="ui-b">
    <p><strong>foo1</strong></p>
    <p>bar1</p>
</div>
<div class="ui-c">
    <p><strong>bla1</strong></p>
    <p>...</p>
</div>

Pattern:

.*?<div class="ui-a">.*?<strong>(.*?)</strong>.*?<p>(.*?)</p>.*?</div>.*?<div class="ui-c">.*?<strong>(.*?)</strong>.*?
like image 681
CannyDuck Avatar asked Mar 29 '26 23:03

CannyDuck


2 Answers

Since you can't make use of an HTML parser, try something like this:

import java.util.regex.*;

public class Main {
    public static void main (String[] args) {
        String html =
                "...\n" +
                "<div class=\"ui-a\">\n" +
                "<div class=\"ui-b\">\n" +
                "    <p><strong>foo</strong></p>\n" +
                "    <p>bar</p>\n" +
                "</div>\n" +
                "<div class=\"ui-c\">\n" +
                "    <p><strong>bla</strong></p>\n" +
                "    <p>...</p>\n" +
                "</div>\n" +
                "</div>\n" +
                "\n" +
                "<div class=\"ui-a\">\n" +
                "<div class=\"ui-b\">\n" +
                "    <p><strong>foo1</strong></p>\n" +
                "    <p>bar1</p>\n" +
                "</div>\n" +
                "<div class=\"ui-c\">\n" +
                "    <p><strong>bla1</strong></p>\n" +
                "    <p>...</p>\n" +
                "</div>";

        Pattern p = Pattern.compile(
                "(?sx)                               # enable DOT-ALL and COMMENTS     \n" +
                "<div\\s+class=\"ui-a\">             # match '<div...ui-a...>'         \n" +
                "(?:(?!<strong>).)*+                 # match everything up to <strong> \n" +
                "<strong>([^<>]++)</strong>          # match <strong>...</strong>      \n" +
                "(?:(?!<p>).)*+                      # match up to <p>                 \n" +
                "<p>([^<>]++)</p>                    # match <p>...</p>                \n" +
                "(?:(?!<div\\s+class=\"ui-c\">).)*+  # match up to '<div...ui-a...>'   \n" +
                "<div\\s+class=\"ui-c\">             # match '<div...ui-c...>'         \n" +
                "(?:(?!<strong>).)*+                 # match everything up to <strong> \n" +
                "<strong>([^<>]++)</strong>          # match <strong>...</strong>      \n"
        );

        Matcher m = p.matcher(html);

        while(m.find()) {
            System.out.println("---------------");
            for(int i = 1; i <= m.groupCount(); i++) {
                System.out.printf("group(%d) = %s\n", i, m.group(i));
            }
        }
    }
}

which will print the following to the console:

---------------
group(1) = foo
group(2) = bar
group(3) = bla
---------------
group(1) = foo1
group(2) = bar1
group(3) = bla1

Note my changes:

  • *+ and ++: http://www.regular-expressions.info/possessive.html
  • instead of .*?, I used (?:(?!...).)*+. The first, .*? will keep track of all possible matches it makes to be able to back-track at a later stage. The latter, (?:(?!...).)*+, will not keep track of these matches.

That should make it quicker (not sure by how much...).

like image 106
Bart Kiers Avatar answered Apr 01 '26 09:04

Bart Kiers


Seems, what you are looking for is between tag only, you can work with:

<strong>([a-zA-Z0-9]+)</strong>

further, depending on what comes inside strong tag, you can change the pattern e.g. if you are sure that the text is always small case you can remove A-Z from above pattern or if it contains only 4 characters you can use a {4} after the pattern.

like image 22
Ravi Bhatt Avatar answered Apr 01 '26 11:04

Ravi Bhatt



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!