I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just extracting the text from the pdf file but I don't know how to do so. What is currently the best and easiest way to extract text from a PDF file into a string? What library is best to use today and how can I do it? I have tried using PyPDF2 but everytime I try to extract text from any page using extractText(), it returns empty strings. I have tried installing textract but I get errors because I need more libraries I think. <pre class="prettyprint"><code>import PyPDF2 pdfFileObj = open("January2019.pdf", 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pageObj = pdfReader.getPage(0) print(pageObj.extractText()) </code></pre> This prints empty strings when it should be printing the contents of the page

I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it. Hope this will be helpful to you. <pre class="prettyprint"><code>import pdfplumber pdf = pdfplumber.open('pdffile.pdf') page = pdf.pages[0] text = page.extract_text() print(text) pdf.close() </code></pre>

How to extract text from pdf in Python 3.7

Tags:

python

pdf

python-3.7

pypdf2

pdf-extraction

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just extracting the text from the pdf file but I don't know how to do so.

What is currently the best and easiest way to extract text from a PDF file into a string? What library is best to use today and how can I do it?

I have tried using PyPDF2 but everytime I try to extract text from any page using extractText(), it returns empty strings. I have tried installing textract but I get errors because I need more libraries I think.

Click to copy

import PyPDF2

pdfFileObj = open("January2019.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

pageObj = pdfReader.getPage(0)
print(pageObj.extractText())

This prints empty strings when it should be printing the contents of the page

846

asked Apr 19 '19 20:04

RaV1oLLi

2 Answers

I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it.

Hope this will be helpful to you.

Click to copy

import pdfplumber
pdf = pdfplumber.open('pdffile.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdf.close()

130

answered Sep 20 '22 18:09

Fly your ideas

Using tika worked for me!

Click to copy

from tika import parser

rawText = parser.from_file('January2019.pdf')

rawList = rawText['content'].splitlines()

This made it really easy to extract separate each line in the bank statement into a list.

answered Sep 20 '22 18:09

RaV1oLLi

Related questions
                            
                                Pandas: How to remove rows from a dataframe based on a list?
                            
                                How do I iterate over all lines of files passed on the command line?
                            
                                Custom django admin templates not working
                            
                                How do I check for valid Git branch names?
                            
                                Python sort() first element of list
                            
                                ImportError: No module named crispy-forms
                            
                                Conditional compilation in Python
                            
                                What are the important language features (idioms) of Python to learn early on [duplicate]
                            
                                Install Python 2.6 without using installer on Win32
                            
                                Django: WSGIRequest' object has no attribute 'user' on some pages?
                            
                                Scrapy Shell - How to change USER_AGENT
                            
                                Iterating over dict values
                            
                                Parse a single CSV string?
                            
                                How to find the largest number(s) in a list of elements, possibly non-unique?
                            
                                Finding points on a rectangle at a given angle
                            
                                Converting a django ValuesQuerySet to a json object
                            
                                Dynamically assigning function implementation in Python
                            
                                How to add additional column to Django QuerySet
                            
                                How to write a simple callback function?
                            
                                Couldn't find that process type, Heroku

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract text from pdf in Python 3.7

Tags:

python

pdf

python-3.7

pypdf2

pdf-extraction

RaV1oLLi

People also ask

2 Answers

Fly your ideas

RaV1oLLi

Recent Activity

Donate For Us