I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when i use one of the many available utilities to do this, it loses all formatting and all the tabulated data in the PDF gets jumbled up. Is it possible to use Python to extract the text from the PDF by specifying postions, etc?
Thanks.
You can capture text from a scanned image, upload your image file from your computer, or take a screenshot on your desktop. Then simply right click on the image, and select Grab Text. The text from your scanned PDF can then be copied and pasted into other programs and applications.
Copy selected text Choose Edit > Copy to copy the selected text to another application. Right-click on the selected text, and then select Copy. Right-click on the selected text, and then choose Copy With Formatting.
Keyboard Commands: Select your text while in the editor. Hold down the CTRL key and press X to cut. Hold down the CTRL key and press C to copy.
PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at http://www.jpedal.org/PDFblog/2009/04/pdf-text/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With