I have a text file that contains more or less paragraphs. The text is not actually words, its comma delimited data; but that's not really that important. The text file is sort of divided into sections; there can be sections, and subsections. The division of sections is denoted by more than one newlines and subsections by a newline.
So sample data:
This is the, start of a, section
908690,246246246,246246
246246,246,246246
This is, the next, section,
sfhklj,sfhjk,4626246
4yw2,fdhds5juj,53ujj
So the above data contains two sections, each with three subsections. Sometimes however, there is more than one empty line between sections. When this occurs, I want to convert the multiple newline characters, say \n\n\n\n
to just \n\n
; I think regex is probably the way to do this. I also may need to use different newline standards, unix \n
, and windows \r\n
. I think the files probably contain multiple endline encodings.
Here is the regex that I've come up with; its nothing special:
Regex.Replace(input, @"([\r\n|\n]{2,})", Enviroment.NewLine + Enviroment.NewLine}
Firstly, is this a good regex solution? I'm not that good with regex.
Secondly, I then want to split each section into an element in a string array:
Regex.Split(input, Enviroment.NewLine + Enviroment.NewLine)
Is there a way to combine these steps?
C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...
In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr. Stroustroupe.
C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.
C is more difficult to learn than JavaScript, but it's a valuable skill to have because most programming languages are actually implemented in C. This is because C is a “machine-level” language. So learning it will teach you how a computer works and will actually make learning new languages in the future easier.
[\r\n|\n]
is wrong. That's a character class that matches one of the characters \r
, \n
, or |
.
Common idioms for matching a generic line separator are (?:\r\n|[\r\n])
or (?:\n|\r\n?)
. These will match \r\n
(DOS/Windows), \r
(older Macintosh), or \n
(Unix/Linux/Mac OS X).
I would normalize all line separators to \n
, then split on two or more of those:
Regex.Split(Regex.Replace(source, @"(?:\r\n|[\r\n])", "\n"), @"\n{2,}")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With