Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting codes with optional special characters from a string using Regex in C#

Tags:

string

c#

regex

I'm tying to extract the header and a 2 or 3 digit ISO 639 code from a string.

The general format of a valid string is:

header + <special char> + <2 or 3 digit code> + (<special char>forced)

The last section <special character>forced is optional and may or may not be present but if present forced must be preceded with a special character (like . or _ or -) for it to be a considered a valid string.

Examples of valid strings where the header and language code (eng) to be extracted are:

name.eng
name-eng
name(eng)
name(fri)_eng
name(fri)(eng)
name.eng.forced
name(eng).forced
name.(eng).forced
name.fri.eng.forced
name(fri).eng.forced
name.(fri).eng_forced
name-fri-eng.forced
name_(fri)_eng.forced
name(fri)_eng.forced
name(friday)_eng_forced
name(fri)(eng).forced

The one check here is if the language code has a ) after it then it must have a ( before it. This is not critical but would be nice if the regex can check for it.

Examples of invalid strings are:

nameeng
nameeng.forced
name.eng).forced
name(fri)eng.forced
name(friday).engforced
name(fri)(eng)forced

What I came up with to check this is:

(.*)([._\-(])([a-z][a-z][a-z]|[a-z][a-z])((?<=\(...)\))?(.forced)?

I'm also trying for the non critical lookback to check for the ( before the language code if it has a ) after the code. This again isn't critical but not the core issue I'm facing.

The issue is that the header (and consequently the language code) is incorrect for some of the valid names because I think the expression is too greedy (I'm using C#, no way to turn off greedy for all operands). I've tried the right to left option but that didn't seem to work either after rearranging the expression.

Is it possible to achieve what I need from a Regex in C#?

like image 713
rboy Avatar asked Oct 18 '18 18:10

rboy


1 Answers

Posting my suggestion since it turned out to be helpful:

^(.*?[._-]?)(?=[\W_])[._-]?(\()?([a-z]{2,3})(?(2)\)|)(?:[_\W]forced)?$

See the regex demo.

Details

  • ^ - start of string
  • (.*?[._-]?) - Group 1: any 0+ chars, other than newline, as few as possible, and then an optional ., _ or -
  • (?=[\W_])[._-]?(\()? - the next char must be a non-alphanumeric char (due to the (?=[\W_]) posititve lookahead), then an optional ., - or _ is matched and then an optional ( that is captured into Group 2
  • ([a-z]{2,3}) - 2 or 3 lowercase ASCII letters
  • (?(2)\)|) - a conditional construct: if Group 2 matched, match a ), else match an empty string
  • (?:[_\W]forced)? - an optional non-capturing group matching 1 or 0 occurrences of
    • [_\W] - any non-alphanumeric char
    • forced - a substring
  • $ - end of string.
like image 163
Wiktor Stribiżew Avatar answered Sep 30 '22 18:09

Wiktor Stribiżew