Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting PDF files into Paragraphs

Tags:

I have a question regarding the splitting of pdf files. basically I have a collection of pdf files, which files I want to split in terms of paragraph. so to each paragraph of the pdf file to be a file on its own. I would appreciate if you can help me with this, preferably in Python, but if that is not possible any language will do.

like image 200
LoniF Avatar asked Feb 07 '17 15:02

LoniF


People also ask

Is there a free way to split a PDF?

How can I split one PDF into multiple documents? The Acrobat Split PDF online tool lets you quickly split and separate PDF pages into up to 20 new PDF files without the need to download software or pay for user permissions. First, select a PDF of 500 pages or less, and sign in to Acrobat to upload files.

How do you separate words in a PDF?

After saving the file, the "Split" menu appears on the screen that provides you two options; "Split by Number of Pages" or "Split by Top Level Bookmarks." After that, you can also select the location of the save file by clicking "Save." Then, click "OK" to save each page of the Word document as a separate PDF.


1 Answers

You can use pdftotext for the above, wrap it in python subprocess. Alternatively you could use some other library which already do it implicitly like textract. Here is a quick example, Note: I have used 4 spaces as delimiter to convert the text to paragraph list, you might want to use different technique.

import re
import textract
#read the content of pdf as text
text = textract.process('file_name.pdf')
#use four space as paragraph delimiter to convert the text into list of paragraphs.
print re.split('\s{4,}',text)
like image 80
Radan Avatar answered Oct 11 '22 14:10

Radan