Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split string with regex using a release character and separators

Tags:

c#

regex

I need to parse an EDI file, where the separators are +, : and ' signs and the escape (release) character is ?. You first split into segments

var data = "NAD+UC+ABC2378::92++XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 71+Duzce+Seferihisar / IZMIR++35460+TR"

var segments = data.Split('\'');

then each segment is split into segment data elements by +, then segment data elements are split into component data elements via :.

var dataElements = segments[0].Split('+');

the above sample string is not parsed correctly because of the use of release character. I have special code dealing with this, but I am thinking that this should be all doable using

Regex.Split(data, separator);

I am not familiar with Regex'es and could not find a way to do this so far. The best I came up so far is

string[] lines = Regex.Split(data, @"[^?]\+");

which omits the character before + sign.

NA
U
ABC2378::9
+XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 7
Duzc
Seferihisar / IZMI
+3546
TR

Correct Result Should be:

NAD
UC
ABC2378::92

XYZ Corp.:Tel ?: ?+90 555 555 11 11:Mobile1?: ?+90 555 555 22 22:Mobile2?: ?+90 555 555 41 7
Duzce
Seferihisar / IZMIR
35460
TR

So the question is this doable with Regex.Split, and what should the regex separator look like.

like image 833
hazimdikenli Avatar asked Aug 26 '13 11:08

hazimdikenli


People also ask

How to split a string using regex in Python?

In python, we can split a string using regular expression. Let us see how to split a string using regex in python. We can use re.split () for the same. re is the module and split () is the inbuilt method in that module. Note: Make sure to import the re module or else it will not work. We can split the string using comma as a separator in python.

How to split a string by multiple separators in JavaScript?

Example 1: This example splits a string by 2 separators Comma (, ) and space (‘ ‘) using .split () function. multiple separators. var str = "A, computer science, portal!"; Example 2: This example split the string by number of separators like Comma (, ), equal (=) and colon (:) using multiple .join () and .split () method.

Why is regex better than string split in C?

Regex splits the string based on a pattern. It handles a delimiter specified as a pattern. This is why Regex is better than string.Split. Here are some examples of how to split a string using Regex in C#.

What are the parameters of split () function in JavaScript?

Parameters: This function accepts three parameters as mentioned above and described below: str: This parameter holds the string to be split. separator: It is optional parameter. It defines the character or the regular expression to use for breaking the string.


2 Answers

I can see that you want to split around plus signs + only if they are not preceded (escaped) by a question mark ?. This can be done using the following:

(?<!\?)\+

This matches one or more + signs if they are not preceded by a question mark ?.

Edit: The problem or bug with the previous expression if that it doesn't handle situations like ??+ or ???+ or or ????+, in other words it doesn't handle situations where ?s are used to escape themselves.

We can solve this problem by noticing that if there is an odd number of ? preceding a + then the last one is definitely escaping the + so we must not split, but if there is an even number of ? before a plus then those cancel out each leaving the + so we should split around it.

From the previous observation we should come up with an expression that matches a + only if it is preceded by an even number of question marks ?, and here it is:

(?<!(^|[^?])(\?\?)*\?)\+
like image 115
Ibrahim Najjar Avatar answered Sep 29 '22 18:09

Ibrahim Najjar


string[] lines = Regex.Split(data, @"\+"); 

would it meet the requirement??

Here is the edit for escaping the '?' before '+'.

string[] lines = Regex.Split(data, @"(?<!\?)[\+]+"); 

The '+' end the end would match multiple consecutive occurances of seperator '+'. If you want white spaces instead.

string[] lines = Regex.Split(data, @"(?<!\?)[\+]"); 
like image 44
Irfan Avatar answered Sep 29 '22 19:09

Irfan