Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to extract formatted text content from PDF

How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?

like image 771
hoju Avatar asked Feb 04 '10 00:02

hoju


People also ask

Is there a way to extract text from PDF?

Once you've opened the file, click on the "Edit" tab, and then click on the "edit" icon. Now you can right-click on the text and select "Copy" to extract the text you need.

How do I copy exact format from PDF to Word?

Steps to Copy from PDF to Word and Save Format in Adobe Open the PDF file in Adobe Acrobat, if you want to edit the PDF first, use the editing tools from the panel. Go to Tools>Export PDF, save PDF as Word document, then do the copying & pasting.


1 Answers

To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.

I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.

like image 179
Etienne Avatar answered Oct 26 '22 02:10

Etienne