Paragraph Segmentation using Machine Learning

Question

I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to segment the text into paragraphs.

I can't use regexes, because the documents have no single style:

The number of l between paragraphs vary between 2 and 4.
In some documents the lines within a single paragraph are separated by 2 l, some with single l.

So I turn to machine learning. In the (great) Python NLTK book there's an excellent use of classification for segmentation of sentences using attributes like characters before and after a '.' with a Bayesian network, but no paragraph segmentation.

So my questions are:

Is there another way for paragraph segmentation?
If I go with machine learning, is there tagged data of segmented paragraphs I can use for training?

martin_wun · Accepted Answer

There is surprisingly little research on this topic of automatic detection of paragraph boundaries. I have found the following, all of which are quite old:

Sporleder and Lapata (2004): Automatic Paragraph Identification: A Study across Languages and Domains

Sporleder and Lapata (2005): Broad coverage paragraph segmentation across languages and domains

Filippova and Strube (2006): Using Linguistically Motivated Features for Paragraph Boundary Identification

Genzel (2005) A Paragraph Boundary Detection System

Paragraph Segmentation using Machine Learning

Tags:

python

machine-learning

text-segmentation

nlp

apache-tika

Gino

1 Answers

martin_wun

Recent Activity

Donate For Us

Paragraph Segmentation using Machine Learning

Tags:

python

machine-learning

text-segmentation

nlp

apache-tika

Gino

1 Answers

martin_wun

Related questions

Recent Activity

Donate For Us