I'm tying to extract the header and a 2 or 3 digit ISO 639 code from a string.
The general format of a valid string is:
header + <special char> + <2 or 3 digit code> + (<special char>forced)
The last section <special character>forced
is optional and may or may not be present but if present forced
must be preceded with a special character (like .
or _
or -
) for it to be a considered a valid string.
Examples of valid strings where the header and language code (eng
) to be extracted are:
name.eng
name-eng
name(eng)
name(fri)_eng
name(fri)(eng)
name.eng.forced
name(eng).forced
name.(eng).forced
name.fri.eng.forced
name(fri).eng.forced
name.(fri).eng_forced
name-fri-eng.forced
name_(fri)_eng.forced
name(fri)_eng.forced
name(friday)_eng_forced
name(fri)(eng).forced
The one check here is if the language code has a )
after it then it must have a (
before it. This is not critical but would be nice if the regex can check for it.
Examples of invalid strings are:
nameeng
nameeng.forced
name.eng).forced
name(fri)eng.forced
name(friday).engforced
name(fri)(eng)forced
What I came up with to check this is:
(.*)([._\-(])([a-z][a-z][a-z]|[a-z][a-z])((?<=\(...)\))?(.forced)?
I'm also trying for the non critical lookback to check for the (
before the language code if it has a )
after the code. This again isn't critical but not the core issue I'm facing.
The issue is that the header (and consequently the language code) is incorrect for some of the valid names because I think the expression is too greedy (I'm using C#, no way to turn off greedy for all operands). I've tried the right to left option but that didn't seem to work either after rearranging the expression.
Is it possible to achieve what I need from a Regex in C#?
Posting my suggestion since it turned out to be helpful:
^(.*?[._-]?)(?=[\W_])[._-]?(\()?([a-z]{2,3})(?(2)\)|)(?:[_\W]forced)?$
See the regex demo.
Details
^
- start of string(.*?[._-]?)
- Group 1: any 0+ chars, other than newline, as few as possible, and then an optional .
, _
or -
(?=[\W_])[._-]?(\()?
- the next char must be a non-alphanumeric char (due to the (?=[\W_])
posititve lookahead), then an optional .
, -
or _
is matched and then an optional (
that is captured into Group 2([a-z]{2,3})
- 2 or 3 lowercase ASCII letters(?(2)\)|)
- a conditional construct: if Group 2 matched, match a )
, else match an empty string(?:[_\W]forced)?
- an optional non-capturing group matching 1 or 0 occurrences of
[_\W]
- any non-alphanumeric charforced
- a substring$
- end of string.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With