A specific question about regular expression related to fixed width negative look-behind

Question

I am writing a regular expression (for use in Java, if that is important) that attempts to match the number (could be a float) after a $ (allows spaced between the $ and the number), but only if the word immediately preceding it is not the word 'LOST'.

If there are multiple possible matches, the first number should be returned.

For simplicity, assume all are upper-case.

For example, in the following sentence: "I PAID $10.12 FOR THE BEER", 10.12 will be matched. For the sentence "I LOST $11.34 IN THE GAME", there will not be a match. For "I LOST $11.34 IN THE GAME AND PAID $10.12 FOR THE BEER", 10.12 will still be matched.

The regular expression I come up with is

.*?(?<!LOST )[$]\s*(?<NUMBER>[0-9]*[.]?[0-9]*).*

My regular expression generally works fine, though I wonder if there is an easier way to write it/if I am missing any corner case. One slight issue is if there is more than 1 white-space between LOST and $, I still do not want to match but currently my regular expression will match. Unfortunately, negative look-behind have to have fixed width.

Clarification: To clarify, when I say "the word immediately preceding it is not the word 'LOST'", I meant that before '$' there cannot be 'LOST\s*'. This means that both 'LOST $123' and 'LOST $123' should not match 123, but 'LOST! $123' can match 123. The rationale is that the currency should not be directly 'acted on' by LOST; if there is anything other than \s between LOST and $, then there is a good chance the currency is not directly 'acted on' by LOST.

Cary Swoveland · Accepted Answer

Let's say there could be between one and 99 whitespaces between "LOST" and the dollar sign. I've also assumed the number has two decimal digits, and there are no commas that serve as thousands separators. Then one could attempt to match the string with the regular expression

(?<!\bLOST\s{1,99})\$(?<NUMBER>(?:0|[1-9]\d*)\.\d{2})\b

If there were a match the capture group named NUMBER would contain the dollar amount of interest.

Demo

Hover the cursor over the regular expression at the link to obtain explanations of each element of the expression.

Another way is to attempt to match the regular expression

\bLOST\s+\$(?:0|[1-9]\d*)\.\d{2}\b|\$(?<NUMBER>(?:0|[1-9]\d*)\.\d{2})\b

Demo

In this case pay no attention to matches that do not capture; only to ones that do, in which case the capture group NUMBER will contain the monetary value of interest. Here

\bLOST\s+\$(?:0|[1-9]\d*)\.\d{2}\b

matches, but does not capture, values preceded by "LOST" followed by one or more whitespaces followed by a dollar sign. One might say it gobbles up such substrings.

Hao Wu · Answer

Inspired by blhsing's answer, I propose this regex that may look cleaner and have wider edge case coverage:

(?:^|(?<!LOST)\b)\W*\$(?<NUMBER>\d+(?:\.\d+)?)

Because Java cannot have a non-fixed width lookbehind. The position where you put your lookbehind is crutial.

Since you don't want the word before the currency is LOST, you may match a word boundary first:

\b

Then you need to make sure that word is not LOST

(?<!LOST)\b

After that, put your currency match behind it preceded by optional non-word characters:

(?<!LOST)\b\W*\$(?<NUMBER>\d+(?:\.\d+)?)

Then, add some edge cases such as the string starts with the currency:

(?:^|(?<!LOST)\b)\W*\$(?<NUMBER>\d+(?:\.\d+)?)

See the test cases

A specific question about regular expression related to fixed width negative look-behind

Tags:

java

regex

Student

2 Answers

Cary Swoveland

Hao Wu

Recent Activity

Donate For Us

A specific question about regular expression related to fixed width negative look-behind

Tags:

java

regex

Student

2 Answers

Cary Swoveland

Hao Wu

Related questions

Recent Activity

Donate For Us