Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I match multi-line patterns in the command line with perl-style regex?

I regularly use regex to transform text.

To transform, giant text files from the command line, perl lets me do this:

perl -pe < in.txt > out.txt

But this is inherently on a line-by-line basis. Occasionally, I want to match on multi-line things.

How can I do this in the command-line?

like image 210
JnBrymn Avatar asked Feb 12 '23 18:02

JnBrymn


2 Answers

To slurp a file instead of doing line by line processing, use the -0777 switch:

perl -0777 -pe 's/.../.../g' in.txt > out.txt

As documented in perlrun #Command Switches:

The special value -00 will cause Perl to slurp files in paragraph mode. Any value -0400 or above will cause Perl to slurp files whole, but by convention the value -0777 is the one normally used for this purpose.

Obviously, for large files this may not work well, in which case you'll need to code some type of buffer to do this replacement. We can't advise any better though without real information about your intent.

like image 96
Miller Avatar answered Feb 15 '23 07:02

Miller


Grepping across line boundaries

So you want to grep across lines boundaries...

You quite possibly already have pcregrep installed. As you may know, PCRE stands for Perl-Compatible Regular Expressions, and the library is definitely Perl-style, though not identical to Perl.

To match across multiple lines, you have to turn on the multi-line mode -M, which is not the same as (?m)

Running pcregrep -M "(?s)^b.*\d+" text.txt

On this text file:

a
b
c11

The output will be

b
c11

whereas grep would return empty.

Excerpt from the doc:

-M, --multiline Allow patterns to match more than one line. When this option is given, patterns may usefully contain literal newline char- acters and internal occurrences of ^ and $ characters. The output for a successful match may consist of more than one line, the last of which is the one in which the match ended. If the matched string ends with a newline sequence the output ends at the end of that line.

When this option is set, the PCRE library is called in "mul- tiline" mode. There is a limit to the number of lines that can be matched, imposed by the way that pcregrep buffers the input file as it scans it. However, pcregrep ensures that at least 8K characters or the rest of the document (whichever is the shorter) are available for forward matching, and simi- larly the previous 8K characters (or all the previous charac- ters, if fewer than 8K) are guaranteed to be available for lookbehind assertions. This option does not work when input is read line by line (see --line-buffered.)

like image 45
zx81 Avatar answered Feb 15 '23 08:02

zx81