Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Paragraph Segmentation using Machine Learning

I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to segment the text into paragraphs.

I can't use regexes, because the documents have no single style:

  • The number of \nl between paragraphs vary between 2 and 4.
  • In some documents the lines within a single paragraph are separated by 2 \nl, some with single \nl.

So I turn to machine learning. In the (great) Python NLTK book there's an excellent use of classification for segmentation of sentences using attributes like characters before and after a '.' with a Bayesian network, but no paragraph segmentation.

So my questions are:

  • Is there another way for paragraph segmentation?
  • If I go with machine learning, is there tagged data of segmented paragraphs I can use for training?
like image 290
Gino Avatar asked Jan 23 '17 08:01

Gino


1 Answers

There is surprisingly little research on this topic of automatic detection of paragraph boundaries. I have found the following, all of which are quite old:

Sporleder and Lapata (2004): Automatic Paragraph Identification: A Study across Languages and Domains

Sporleder and Lapata (2005): Broad coverage paragraph segmentation across languages and domains

Filippova and Strube (2006): Using Linguistically Motivated Features for Paragraph Boundary Identification

Genzel (2005) A Paragraph Boundary Detection System

like image 169
martin_wun Avatar answered Sep 29 '22 13:09

martin_wun