I need to scrape some content from a HTTP response with Java. The required fields in the response are: foo, bar and bla. My current pattern is very slow. Any ideas how to improve that?
Response:
...
<div class="ui-a">
<div class="ui-b">
<p><strong>foo</strong></p>
<p>bar</p>
</div>
<div class="ui-c">
<p><strong>bla</strong></p>
<p>...</p>
</div>
</div>
<div class="ui-a">
<div class="ui-b">
<p><strong>foo1</strong></p>
<p>bar1</p>
</div>
<div class="ui-c">
<p><strong>bla1</strong></p>
<p>...</p>
</div>
Pattern:
.*?<div class="ui-a">.*?<strong>(.*?)</strong>.*?<p>(.*?)</p>.*?</div>.*?<div class="ui-c">.*?<strong>(.*?)</strong>.*?
Since you can't make use of an HTML parser, try something like this:
import java.util.regex.*;
public class Main {
public static void main (String[] args) {
String html =
"...\n" +
"<div class=\"ui-a\">\n" +
"<div class=\"ui-b\">\n" +
" <p><strong>foo</strong></p>\n" +
" <p>bar</p>\n" +
"</div>\n" +
"<div class=\"ui-c\">\n" +
" <p><strong>bla</strong></p>\n" +
" <p>...</p>\n" +
"</div>\n" +
"</div>\n" +
"\n" +
"<div class=\"ui-a\">\n" +
"<div class=\"ui-b\">\n" +
" <p><strong>foo1</strong></p>\n" +
" <p>bar1</p>\n" +
"</div>\n" +
"<div class=\"ui-c\">\n" +
" <p><strong>bla1</strong></p>\n" +
" <p>...</p>\n" +
"</div>";
Pattern p = Pattern.compile(
"(?sx) # enable DOT-ALL and COMMENTS \n" +
"<div\\s+class=\"ui-a\"> # match '<div...ui-a...>' \n" +
"(?:(?!<strong>).)*+ # match everything up to <strong> \n" +
"<strong>([^<>]++)</strong> # match <strong>...</strong> \n" +
"(?:(?!<p>).)*+ # match up to <p> \n" +
"<p>([^<>]++)</p> # match <p>...</p> \n" +
"(?:(?!<div\\s+class=\"ui-c\">).)*+ # match up to '<div...ui-a...>' \n" +
"<div\\s+class=\"ui-c\"> # match '<div...ui-c...>' \n" +
"(?:(?!<strong>).)*+ # match everything up to <strong> \n" +
"<strong>([^<>]++)</strong> # match <strong>...</strong> \n"
);
Matcher m = p.matcher(html);
while(m.find()) {
System.out.println("---------------");
for(int i = 1; i <= m.groupCount(); i++) {
System.out.printf("group(%d) = %s\n", i, m.group(i));
}
}
}
}
which will print the following to the console:
--------------- group(1) = foo group(2) = bar group(3) = bla --------------- group(1) = foo1 group(2) = bar1 group(3) = bla1
Note my changes:
*+ and ++: http://www.regular-expressions.info/possessive.html.*?, I used (?:(?!...).)*+. The first, .*? will keep track of all possible matches it makes to be able to back-track at a later stage. The latter, (?:(?!...).)*+, will not keep track of these matches.That should make it quicker (not sure by how much...).
Seems, what you are looking for is between tag only, you can work with:
<strong>([a-zA-Z0-9]+)</strong>
further, depending on what comes inside strong tag, you can change the pattern e.g. if you are sure that the text is always small case you can remove A-Z from above pattern or if it contains only 4 characters you can use a {4} after the pattern.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With