Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split multi-paragraph documents into paragraph-numbered sentences

I have a list of well-parsed, multi-paragraph documents (all paragraphs separated by \n\n and sentences separated by ".") that I'd like to split into sentences, together with a number indicating the paragraph number within the document. For example, the (two paragraph) input is:

First sentence of the 1st paragraph. Second sentence of the 1st paragraph. \n\n 

First sentence of the 2nd paragraph. Second sentence of the 2nd paragraph. \n\n

Ideally the output should be:

1 First sentence of the 1st paragraph. 

1 Second sentence of the 1st paragraph. 

2 First sentence of the 2nd paragraph.

2 Second sentence of the 2nd paragraph.

I'm familiar with the Lingua::Sentences package in Perl that can split documents into sentences. However it is not compatible with paragraph numbering. As such I'm wondering if there's an alternative way to achieve the above (the documents contains no abbreviations). Any help is greatly appreciated. Thanks!

like image 310
user735276 Avatar asked Mar 22 '23 19:03

user735276


1 Answers

If you can rely on period . being the delimiter, you can do this:

perl -00 -nlwe 'print qq($. $_) for split /(?<=\.)/' yourfile.txt

Explanation:

  • -00 sets the input record separator to the empty string, which is paragraph mode.
  • -l sets the output record separator to the input record separator, which in this case translates to two newlines.

Then we simply split on period with a lookbehind assertion and print the sentences, preceded by the line number.

like image 74
TLP Avatar answered Apr 06 '23 03:04

TLP