I've got the following string:
'USD 100'
Based on this post I'm trying to capture 100
if USD
is contained in the string or the individual (currency) characters if USD
is not contained in the string.
For example:
'USD 100' # => '100'
'YEN 300' # => ['Y', 'E', 'N']
So far I've got up to this but it's not working:
https://rubular.com/r/cK8Hn2mzrheHXZ
Interestingly if I place the USD
after the amount it seems to work. Ideally I'd like to have the same behaviour regardless of the position of the currency characters.
Your regex (?=.*(USD))(?(1)\d+|[a-zA-Z])
does not work because
(?=.*(USD))
- a positive lookahead, triggered at every location inside a string (if scan
is used) that matches USD
substring after any 0 or more chars other than line break chars as many as possible (it means, there will only be a match if there is USD
somewhere on a line)(?(1)\d+|[a-zA-Z])
- a conditional construct that matches 1+ digits if Group 1 matched (if there is USD
), or, an ASCII letter will be tried. However, the second alternative pattern will never be tried, because you required USD
to be present in the string for a match to occur.Look at the USD 100
regex debugger, it shows exactly what happens when the (?=.*(USD))(?(1)\d+|[a-zA-Z])
regex tries to find a match:
USD
is found at the start of the string (since the first time the pattern is tried, the regex index is at the string start position). The lookahead found a match.(?(1)
condition is met, Group 1, USD
, was matched. So, the first, then
, part is triggered. \d+
does not find any digits, since there is U
letter at the start. The regex match fails at the string start position, but there are more positions in the string to test since there is no \A
nor ^
anchor that would only let a match to occur if the match is found at the start of the string/line.S
.USD
immediately to the right of the current location, but fails (U
is already "behind" the index).USD
anywhere to the right of the current location and eventually fails.If the USD
is somewhere to the right of 100
, then you'd get a match.
So, the lookahead does not set any search range, it simply allows matching the rest of the patterns (if its pattern matches) or not (if its pattern is not found).
You may use
.scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
Pattern details
^USD.*?\K(\d+)
- either USD
at the start of the string, then any 0 or more chars other than line break chars as few as possible, and then the text matched is dropped and 1+ digits are captured into Group 1|
- or([a-zA-Z])
- any ASCII letter captured into Group 2.See Ruby demo:
p "USD 100".scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
# => ["100"]
p "YEN 100".scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
# => ["Y", "E", "N"]
Anatomy of your pattern
(?=.*(USD))(?(1)\d+|[a-zA-Z])
| | | | | |_______
| | | | | Else match a single char a-zA-Z
| | | | |
| | | | |__
| | | | If group 1 exists, match 1+ digits
| | | |
| | | |__
| | | Test for group 1
| | |_________________
| | If Clause
| |___
| Capture group 1
|__________
Positive lookahead
About the pattern you tried
The positive lookahead is not anchored and will be tried on each position. It will continue the match if it returns true, else the match stops and the engine will move to the next position.
Why does the pattern not match?
On the first position the lookahead is true as it can find USD on the right.
It tries to match 1+ digits, but the first char is U
which it can not match.
USD 100
⎸
First position
From the second position till the end, the lookahead is false because it can not find USD on the right.
USD 100
⎸
Second position
Eventually, the if clause is only tried once, where it could not match 1+ digits. The else clause is never tried and overall there is no match.
For the YEN 300
part, the if clause is never tried as the lookahead will never find USD at the right and overall there is no match.
Interesting resources about conditionals can be for example found at rexegg.com and regular-expressions.info
If you want the separate matches, you might use:
\bUSD \K\d+|[A-Z](?=[A-Z]* \d+\b)
Explanation
\bUSD
Match USD and a space\K\d+
Forget what is matched using \K
and match 1+ digits|
Or[A-Z]
Match a char A-Z(?=[A-Z]* \d+\b)
Assert what is on the right is optional chars A-Z and 1+ digitsregex demo
Or using capturing groups:
\bUSD \K(\d+)|([A-Z])(?=[A-Z]* \d+\b)
Regex demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With