Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text from a PDF - All pages and Output - file using Python

Tags:

python

Im new on Python. I am using this code to extract text. Is it possible extract all pages and have an output in a file?

import PyPDF2
pdf_file = open('sample.pdf','rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(10)
page_content = page.extractText()
print (page_content)
like image 954
Raquel Dourado Avatar asked Apr 10 '17 03:04

Raquel Dourado


1 Answers

Use a loop to extract each page's text and write each page's text to a single file.

import PyPDF2
with open('sample.pdf','rb') as pdf_file, open('sample.txt', 'w') as text_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    for page_number in range(number_of_pages):   # use xrange in Py2
        page = read_pdf.getPage(page_number)
        page_content = page.extractText()
        text_file.write(page_content)
like image 136
kindall Avatar answered Nov 14 '22 08:11

kindall