Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Regex.Split: Removing empty results

Tags:

c#

regex

split

I am working on an application which imports thousands of lines where every line has a format like this:

|* 9070183020  |04.02.2011    |107222     |M/S SUNNY MEDICOS                  |GHAZIABAD                          |      32,768.00 |

I am using the following Regex to split the lines to the data I need:

Regex lineSplitter = new Regex(@"(?:^\|\*|\|)\s*(.*?)\s+(?=\|)");
string[] columns = lineSplitter.Split(data);

foreach (string c in columns)
    Console.Write("[" + c + "] ");

This is giving me the following result:

[] [9070183020] [] [04.02.2011] [] [107222] [] [M/S SUNNY MEDICOS] [] [GHAZIABAD] [] [32,768.00] [|]

Now I have two questions.
1. How do I remove the empty results. I know I can use:

string[] columns = lineSplitter.Split(data).Where(s => !string.IsNullOrEmpty(s)).ToArray();

but is there any built in method to remove the empty results?

2. How can I remove the last pipe?

Thanks for any help.
Regards,
Yogesh.

EDIT:
I think my question was a little misunderstood. It was never about how I can do it. It was only about how can I do it by changing the Regex in the above code.

I know that I can do it in many ways. I have already done it with the code mentioned above with a Where clause and with an alternate way which is also (more than two times) faster:

Regex regex = new Regex(@"(^\|\*\s*)|(\s*\|\s*)");
data = regex.Replace(data, "|");

string[] columns = data.Split(new[] { '|' }, StringSplitOptions.RemoveEmptyEntries);

Secondly, as a test case, my system can parse 92k+ such lines in less than 1.5 seconds in the original method and in less than 700 milliseconds in the second method, where I will never find more than a couple of thousand in real cases, so I don't think I need to think about the speed here. In my opinion thinking about speed in this case is Premature optimization.

I have found the answer to my first question: it cannot be done with Split as there is no such option built in.

Still looking for answer to my second question.

like image 963
Yogesh Avatar asked Feb 06 '11 08:02

Yogesh


3 Answers

Regex lineSplitter = new Regex(@"[\s*\*]*\|[\s*\*]*");
var columns = lineSplitter.Split(data).Where(s => s != String.Empty);

or you could simply do:

string[] columns = data.Split(new char[] {'|'}, StringSplitOptions.RemoveEmptyEntries);
foreach (string c in columns) this.textBox1.Text += "[" + c.Trim(' ', '*') + "] " + "\r\n";

And no, there is no option to remove empty entries for RegEx.Split as is for String.Split.

You can also use matches.

like image 148
Jaroslav Jandek Avatar answered Sep 24 '22 18:09

Jaroslav Jandek


Don't use a regex at all in your case. It doesn't seem you need one and regexes are much slower (and have a much higher overhead) than directly using the string functions.

So use somewhat like:

const Char[] splitChars = new Char[] {'|'};

string[] splitData = data.Split(splitChars, StringSplitOptions.RemoveEmptyEntries)
like image 2
Foxfire Avatar answered Sep 23 '22 18:09

Foxfire


I think this may work as an equivalent to remove empty strings:

string[] splitter = Regex.Split(textvalue,@"\s").Where(s => s != String.Empty).ToArray<string>();
like image 2
Peter Avatar answered Sep 21 '22 18:09

Peter