I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to segment the text into paragraphs.
I can't use regexes, because the documents have no single style:
\nl
between paragraphs vary between 2 and 4.\nl
, some with single \nl
.So I turn to machine learning. In the (great) Python NLTK book there's an excellent use of classification for segmentation of sentences using attributes like characters before and after a '.' with a Bayesian network, but no paragraph segmentation.
So my questions are:
There is surprisingly little research on this topic of automatic detection of paragraph boundaries. I have found the following, all of which are quite old:
Sporleder and Lapata (2004): Automatic Paragraph Identification: A Study across Languages and Domains
Sporleder and Lapata (2005): Broad coverage paragraph segmentation across languages and domains
Filippova and Strube (2006): Using Linguistically Motivated Features for Paragraph Boundary Identification
Genzel (2005) A Paragraph Boundary Detection System
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With