I have a question regarding the splitting of pdf files. basically I have a collection of pdf files, which files I want to split in terms of paragraph. so to each paragraph of the pdf file to be a file on its own. I would appreciate if you can help me with this, preferably in Python, but if that is not possible any language will do.
How can I split one PDF into multiple documents? The Acrobat Split PDF online tool lets you quickly split and separate PDF pages into up to 20 new PDF files without the need to download software or pay for user permissions. First, select a PDF of 500 pages or less, and sign in to Acrobat to upload files.
After saving the file, the "Split" menu appears on the screen that provides you two options; "Split by Number of Pages" or "Split by Top Level Bookmarks." After that, you can also select the location of the save file by clicking "Save." Then, click "OK" to save each page of the Word document as a separate PDF.
You can use pdftotext for the above, wrap it in python subprocess. Alternatively you could use some other library which already do it implicitly like textract. Here is a quick example, Note: I have used 4 spaces as delimiter to convert the text to paragraph list, you might want to use different technique.
import re
import textract
#read the content of pdf as text
text = textract.process('file_name.pdf')
#use four space as paragraph delimiter to convert the text into list of paragraphs.
print re.split('\s{4,}',text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With