Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching double line breaks using Regex

Tags:

regex

edifact

I am writing a Regex that will extract the various pieces of information from an EDIFACT UN Codes List. As there are tens of thousands of codes I do not wish to type them all in so I have decided to use Regex to parse the text file and extract out the bits that I need. The text file is structured in a way that I can easily identify the bits that I want.

I have created the following Regex using Regex Hero to test it, but I just cannot get it to match everything up to a double line break for the codeComment group. I have tried using the character class [^\n\n] but this still won't match double line breaks.

Note: I have selected the Multiline option on Regex Hero.

(?<element>\d+)\s\s(?<elementName>.*)\[[B|C|I]\]\s+Desc: (?<desc>[^\n]*\s*[^\n]*)
^\s*Repr: (?<type>a(?:n)?)..(?<length>\d+)
^\s*(?<code>\d+)\s*(?<codeName>[^\n]*)
^\s{14}(?<codeComment>[^\n]*)

This is the example text I am using to match.

----------------------------------------------------------------------

  • 1073 Document line action code [B]

    Desc: Code indicating an action associated with a line of a
        document.

    Repr: an..3

    1 Included in document/transaction
        The document line is included in the
        document/transaction.
        should capture this as well.

    2 Excluded from document/transaction
        The document line is excluded from the
        document/transaction.

What I want is for codeComment to contain the following:

The document line is included in the
          document/transaction.
          should capture this as well.

but it is only extracting the first line:

The document line is included in the
like image 709
Intrepid Avatar asked Mar 31 '26 18:03

Intrepid


1 Answers

In a character class, every character counts once, no matter how often you write it. So a character class can't be used to check for consecutive linebreaks. But you can use a lookahead assertion:

^\s{14}(?<codeComment>(?s)(?:(?!\n\n).)*)

(?s) switches on singleline mode (to allow the dot to match newlines).

(?!\n\n) asserts that there are no two consecutive linebreaks at the current position.

like image 180
Tim Pietzcker Avatar answered Apr 02 '26 13:04

Tim Pietzcker