Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Regular Expressions

I am having problems trying to use the regular expression that I used in JavaScript. On a web page, you may have:

<b>Renewal Date:</b> 03 May 2010</td>

I just want to be able to pull out the 03 May 2010, remembering that a webpage has more than just the above content. The way I currently perform this using JavaScript is:

DateStr = /<b>Renewal Date:<\/b>(.+?)<\/td>/.exec(returnedHTMLPage);

I tried to follow some tutorials on java.util.regex.Pattern and java.util.regex.Matcher with no luck. I can't seem to be able to translate (.+?) into something they can understand??

thanks,

Noeneel

like image 921
bebeTech Avatar asked Apr 15 '26 01:04

bebeTech


2 Answers

This is how regular expressions are used in Java:

Pattern p = Pattern.compile("<b>Renewal Date:</b>(.+?)</td>");
Matcher m = p.matcher(returnedHTMLPage);

if (m.find()) // find the next match (and "generate the groups")
    System.out.println(m.group(1)); // prints whatever the .+? expression matched.

There are other useful methods in the Matcher class, such as m.matches(). Have a look at Matcher.

like image 121
aioobe Avatar answered Apr 16 '26 13:04

aioobe


On matches vs find

The problem is that you used matches when you should've used find. From the API:

  • The matches method attempts to match the entire input sequence against the pattern.
  • The find method scans the input sequence looking for the next subsequence that matches the pattern.

Note that String.matches(String regex) also looks for a full match of the entire string. Unfortunately String does not provide a partial regex match, but you can always s.matches(".*pattern.*") instead.


On reluctant quantifier

Java understands (.+?) perfectly.

Here's a demonstration: you're given a string s that consists of a string t repeating at least twice. Find t.

System.out.println("hahahaha".replaceAll("^(.+)\\1+$", "($1)"));
// prints "(haha)" -- greedy takes longest possible

System.out.println("hahahaha".replaceAll("^(.+?)\\1+$", "($1)"));
// prints "(ha)" -- reluctant takes shortest possible

On escaping metacharacters

It should also be said that you have injected \ into your regex ("\\" as Java string literal) unnecessarily.

        String regexDate = "<b>Expiry Date:<\\/b>(.+?)<\\/td>";
                                            ^^         ^^
        Pattern p2 = Pattern.compile("<b>Expiry Date:<\\/b>");
                                                      ^^

\ is used to escape regex metacharacters. A / is NOT a regex metacharacter.

See also

  • Regular expressions and escaping special characters
like image 40
polygenelubricants Avatar answered Apr 16 '26 14:04

polygenelubricants



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!