Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Regex.Replace Multiple Newlines

Tags:

c#

regex

I have a text file that contains more or less paragraphs. The text is not actually words, its comma delimited data; but that's not really that important. The text file is sort of divided into sections; there can be sections, and subsections. The division of sections is denoted by more than one newlines and subsections by a newline.

So sample data:

This is the, start of a, section
908690,246246246,246246
246246,246,246246

This is, the next, section,
sfhklj,sfhjk,4626246
4yw2,fdhds5juj,53ujj

So the above data contains two sections, each with three subsections. Sometimes however, there is more than one empty line between sections. When this occurs, I want to convert the multiple newline characters, say \n\n\n\n to just \n\n; I think regex is probably the way to do this. I also may need to use different newline standards, unix \n, and windows \r\n. I think the files probably contain multiple endline encodings.

Here is the regex that I've come up with; its nothing special:

Regex.Replace(input, @"([\r\n|\n]{2,})", Enviroment.NewLine + Enviroment.NewLine}

Firstly, is this a good regex solution? I'm not that good with regex.

Secondly, I then want to split each section into an element in a string array:

Regex.Split(input, Enviroment.NewLine + Enviroment.NewLine)

Is there a way to combine these steps?

like image 896
Shawn Avatar asked Oct 21 '10 23:10

Shawn


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr. Stroustroupe.

Is C language easy?

C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.

Is C programming hard?

C is more difficult to learn than JavaScript, but it's a valuable skill to have because most programming languages are actually implemented in C. This is because C is a “machine-level” language. So learning it will teach you how a computer works and will actually make learning new languages in the future easier.


1 Answers

[\r\n|\n] is wrong. That's a character class that matches one of the characters \r, \n, or |.

Common idioms for matching a generic line separator are (?:\r\n|[\r\n]) or (?:\n|\r\n?). These will match \r\n (DOS/Windows), \r (older Macintosh), or \n (Unix/Linux/Mac OS X).

I would normalize all line separators to \n, then split on two or more of those:

Regex.Split(Regex.Replace(source, @"(?:\r\n|[\r\n])", "\n"), @"\n{2,}")
like image 141
Alan Moore Avatar answered Sep 30 '22 14:09

Alan Moore