Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match only the first paragraph using bash

We have

...a file containing paragraphs, splitted by 2 newlines \r\n\r\n or \n\n. The paraghraphs themselves may contain single newlines \r\n or \n. The goal is to use a Bash one-liner to match only the first paragraph and to print it to stdout.

E.G.:

$ cat foo.txt
Foo
* Bar

Baz
* Foobar

Even more stuff to match here.

results in:

$ cat foo.txt | <some-command>
Foo
* Bar

I've already tried

...this regex (?s)(.+?)(\r?\n){2}|.+?$ with grep using

  • GIT Bash on Windows (GNU grep 3.1),
  • Bash on Lubuntu 20.4.1 LTS (GNU grep 3.4) and
  • iTerm+Fish on Mac (BSD grep 2.5.1-FreeBSD).

The first two approaches resulted in:

$ grep -Poz '(?s)(.+?)(\r?\n){2}|.+?$' foo.txt
Foo                                                                                                                          
* Bar

Baz                                                                                                                          
* Foobar

The approach on Mac failed, due to differences between BSD grep and GNU grep.

But

... on regex101.com this regex works on foo.txt: https://regex101.com/r/uoej8O/1. This may be due to disabling the global flag?

like image 405
trilloyd Avatar asked Nov 18 '20 10:11

trilloyd


3 Answers

This is a tailor-made problem for gnu awk by using a custom record separator. We can use a custom RS that breaks file data by 2 or more of an optional \r followed by \n:

awk -v RS='(\r?\n){2,}' 'NR == 1' file

This outputs:

Foo
* Bar

If you want awk to be more efficient when input is very big:

awk -v RS='(\r?\n){2,}' '{print; exit}' file
like image 129
anubhava Avatar answered Nov 16 '22 11:11

anubhava


For GNU awk if the paragraphs are separated by \r\n\r\n or \n\n:

$ awk -v RS="\r?\n\r?\n" '{print $0;exit}' file

Output:

Foo
* Bar
like image 44
James Brown Avatar answered Nov 16 '22 10:11

James Brown


You can use a GNU grep like this:

grep -Poz '(?s)^.+?(?=\R{2}|$)' file

See the PCRE regex demo.

Details

  • (?s) - a DOTALL inline modifier that makes . match all chars including linebreak chars
  • ^ - start of the whole string
  • .+? - any 1 or more chars, as few as possible
  • (?=\R{2}|$) - a positive lookahead that matches a location immediately followed with a double line break sequence (\R{2}) or end of string ($).
like image 4
Wiktor Stribiżew Avatar answered Nov 16 '22 12:11

Wiktor Stribiżew