Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A specific question about regular expression related to fixed width negative look-behind

Tags:

java

regex

I am writing a regular expression (for use in Java, if that is important) that attempts to match the number (could be a float) after a $ (allows spaced between the $ and the number), but only if the word immediately preceding it is not the word 'LOST'.

If there are multiple possible matches, the first number should be returned.

For simplicity, assume all are upper-case.

For example, in the following sentence: "I PAID $10.12 FOR THE BEER", 10.12 will be matched. For the sentence "I LOST $11.34 IN THE GAME", there will not be a match. For "I LOST $11.34 IN THE GAME AND PAID $10.12 FOR THE BEER", 10.12 will still be matched.

The regular expression I come up with is

.*?(?<!LOST )[$]\s*(?<NUMBER>[0-9]*[.]?[0-9]*).*

My regular expression generally works fine, though I wonder if there is an easier way to write it/if I am missing any corner case. One slight issue is if there is more than 1 white-space between LOST and $, I still do not want to match but currently my regular expression will match. Unfortunately, negative look-behind have to have fixed width.

Clarification: To clarify, when I say "the word immediately preceding it is not the word 'LOST'", I meant that before '$' there cannot be 'LOST\s*'. This means that both 'LOST $123' and 'LOST $123' should not match 123, but 'LOST! $123' can match 123. The rationale is that the currency should not be directly 'acted on' by LOST; if there is anything other than \s between LOST and $, then there is a good chance the currency is not directly 'acted on' by LOST.

like image 757
Student Avatar asked Oct 26 '25 09:10

Student


2 Answers

Let's say there could be between one and 99 whitespaces between "LOST" and the dollar sign. I've also assumed the number has two decimal digits, and there are no commas that serve as thousands separators. Then one could attempt to match the string with the regular expression

(?<!\bLOST\s{1,99})\$(?<NUMBER>(?:0|[1-9]\d*)\.\d{2})\b

If there were a match the capture group named NUMBER would contain the dollar amount of interest.

Demo

Hover the cursor over the regular expression at the link to obtain explanations of each element of the expression.



Another way is to attempt to match the regular expression

\bLOST\s+\$(?:0|[1-9]\d*)\.\d{2}\b|\$(?<NUMBER>(?:0|[1-9]\d*)\.\d{2})\b

Demo

In this case pay no attention to matches that do not capture; only to ones that do, in which case the capture group NUMBER will contain the monetary value of interest. Here

\bLOST\s+\$(?:0|[1-9]\d*)\.\d{2}\b

matches, but does not capture, values preceded by "LOST" followed by one or more whitespaces followed by a dollar sign. One might say it gobbles up such substrings.

like image 98
Cary Swoveland Avatar answered Oct 28 '25 22:10

Cary Swoveland


Inspired by blhsing's answer, I propose this regex that may look cleaner and have wider edge case coverage:

(?:^|(?<!LOST)\b)\W*\$(?<NUMBER>\d+(?:\.\d+)?)

Because Java cannot have a non-fixed width lookbehind. The position where you put your lookbehind is crutial.

  1. Since you don't want the word before the currency is LOST, you may match a word boundary first:
\b
  1. Then you need to make sure that word is not LOST
(?<!LOST)\b
  1. After that, put your currency match behind it preceded by optional non-word characters:
(?<!LOST)\b\W*\$(?<NUMBER>\d+(?:\.\d+)?)
  1. Then, add some edge cases such as the string starts with the currency:
(?:^|(?<!LOST)\b)\W*\$(?<NUMBER>\d+(?:\.\d+)?)

See the test cases

like image 41
Hao Wu Avatar answered Oct 28 '25 22:10

Hao Wu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!