Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: ignoring order of groups

Tags:

java

regex

I have a piece of text:

randomtext 1150,25 USD randomtext

and a simple regex to extract the amount of money in different currencies:

(((\d+)(,?\s?|.)(\d{1,2}))\s?(PLN|EUR|USD|CHF|GBP))

Which gives me these groups:

  1. 1150,25 USD
  2. 1150,25
  3. 1150
  4. ,
  5. 25
  6. USD

However, the number and the currency may swap their positions:

randomtext USD 1150,25 randomtext

or

randomtext USD1150,25 randomtext

How should I improve my regex to satisfy that condition without repeating whole groups (AB|BA) while keeping the current grouping?

like image 967
EyesClear Avatar asked Aug 29 '15 14:08

EyesClear


1 Answers

You can use this kind of pattern:

String p = "\\b (?=[\\dPEUCG])  # to jump quickly at interesting positions       \n" +
           "(?=     # open a lookahead                                           \n" +
           "    (?> [\\d,]+ \\s* )? # perhaps the value is before                \n" +
           "    (?<currency> PLN|EUR|USD|CHF|GBP )  # capture the currency       \n" +
           "    (?:\\b|\\d) # a word boundary or a digit                         \n" +
           ")       # close the lookahead                                        \n" +
           "(?> [B-HLNPRSU]{3} \\s* )? (?<value> \\d+(?:,\\d+)? )                  ";

Pattern RegComp = Pattern.compile(p, Pattern.COMMENTS);

String s = "USD 1150,25 randomtext \n" +
           "Non works randomtext 1150,25 USD randomtext\n" +
           "Works randomtextUSD 1150,25 USD randomtext\n" +
           "Works randomtext USD 1150,25 randomtext\n" +
           "Works randomtext USD1150,25 randomtext\n" +
           "Non work randomtext 1150,25 USD randomtext";

Matcher m = RegComp.matcher(s);

while( m.find() ) {
    System.out.println(m.group("value") + " : " + m.group("currency"));
}

The idea is to capture the currency in a lookahead (that is a zero-width assertion). The lookahead is only an assertion and doesn't consume characters, and the subpattern inside describes an eventual value before. So the position of the currency doesn't change anything. The value is captured outside of the lookahead.

About \\b (?=[\\dPEUCG]): The goal of this subpattern is to filter positions in the string that are not the beginning of a word that starts with a digit or one of the first letters of the different currencies without to test the whole pattern.

like image 78
Casimir et Hippolyte Avatar answered Sep 20 '22 17:09

Casimir et Hippolyte