I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to: <pre class="prettyprint"><code>import multiprocessing import textract def extract_txt(file_path): text = textract.process(file_path, method='tesseract') p = multiprocessing.Pool(2) file_path = ['/Users/user/Desktop/sample.pdf'] list(p.map(extract_txt, file_path)) </code></pre> However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <code><start/age = 1> ... page content ... <end/page = 1></code>, but I have no idea of how to do this. Thus, how can I apply the <code>extract_txt</code> function to all the elements of a directory that end with <code>.pdf</code> and return the same files in another directory but in a <code>.txt</code> format, and add a page separator with OCR text extraction?. Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?. UPDATE Regarding the "adding a page separator" issue (<code><start/age = 1> ... page content ... <end/page = 1></code>) after reading Roland Smith's answer I tried to: <pre class="prettyprint"><code>from PyPDF2 import PdfFileWriter, PdfFileReader import textract def extract_text(pdf_file): inputpdf = PdfFileReader(open(pdf_file, "rb")) for i in range(inputpdf.numPages): w = PdfFileWriter() w.addPage(inputpdf.getPage(i)) outfname = 'page{:03d}.pdf'.format(i) with open(outfname, 'wb') as outfile: # I presume you need `wb`. w.write(outfile) print('\n<begin page pos =' , i, '>\n') text = textract.process(str(outfname), method='tesseract') os.remove(outfname) # clean up. print(str(text, 'utf8')) print('\n<end page pos =' , i, '>\n') extract_text('/Users/user/Downloads/ImageOnly.pdf') </code></pre> However, I still have issues with the <code>print()</code> part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file: <pre class="prettyprint"><code>sys.stdout=open("test.txt","w") print('\n<begin page pos =' , i, '>\n') sys.stdout.close() text = textract.process(str(outfname), method='tesseract') os.remove(outfname) # clean up. sys.stdout=open("test.txt","w") print(str(text, 'utf8')) sys.stdout.close() sys.stdout=open("test.txt","w") print('\n<end page pos =' , i, '>\n') sys.stdout.close() </code></pre> Any idea of how to make the page extraction/separator trick and saving everything into a file?...

In your code, you are extracting the text, but you don't do anything with it. Try something like this: <pre class="prettyprint"><code>def extract_txt(file_path): text = textract.process(file_path, method='tesseract') outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf' with open(outfn, 'wb') as output_file: output_file.write(text) return file_path </code></pre> This writes the text to file that has the same name but a <code>.txt</code> extension. It also returns the path of the original file to let the parent know that this file is done. So I would change the mapping code to: <pre class="prettyprint"><code>p = multiprocessing.Pool() file_path = ['/Users/user/Desktop/sample.pdf'] for fn in p.imap_unordered(extract_txt, file_path): print('completed file:', fn) </code></pre> <ul> <li>You don't need to give an argument when creating a <code>Pool</code>. By default it will create as many workers as there are cpu-cores.</li> <li>Using <code>imap_unordered</code> creates an iterator that starts yielding values as soon as they are available.</li> <li>Because the worker function returned the filename, you can print it to let the user know that this file is done.</li> </ul> <hr> Edit 1: The additional question is if it is possible to mark page boundaries. I think it is. A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. <code>pdfinfo</code> from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. <code>pdfseparate</code> from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately. Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method. <hr> Edit 2: If you need a file, write a file: <pre class="prettyprint"><code>from PyPDF2 import PdfFileWriter, PdfFileReader import textract def extract_text(pdf_file): inputpdf = PdfFileReader(open(pdf_file, "rb")) outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf" with open(outfname, 'w') as textfile: for i in range(inputpdf.numPages): w = PdfFileWriter() w.addPage(inputpdf.getPage(i)) outfname = 'page{:03d}.pdf'.format(i) with open(outfname, 'wb') as outfile: # I presume you need `wb`. w.write(outfile) print('page', i) text = textract.process(outfname, method='tesseract') # Add header and footer. text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i) # Write the OCR-ed text to the output file. textfile.write(text) os.remove(outfname) # clean up. print(text) </code></pre>

How to extract text from a directory of PDF files efficiently with OCR?

Tags:

python

python-3.x

parallel-processing

tesseract

apache-tika

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:

import multiprocessing
import textract

def extract_txt(file_path):
    text = textract.process(file_path, method='tesseract')

p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))

However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.

Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.

Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.

UPDATE

Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:

from PyPDF2 import PdfFileWriter, PdfFileReader
import textract


def extract_text(pdf_file):
    inputpdf = PdfFileReader(open(pdf_file, "rb"))
    for i in range(inputpdf.numPages):
        w = PdfFileWriter()
        w.addPage(inputpdf.getPage(i))
        outfname = 'page{:03d}.pdf'.format(i)
        with open(outfname, 'wb') as outfile:  # I presume you need `wb`.
             w.write(outfile)
        print('\n<begin page pos =' , i, '>\n')
        text = textract.process(str(outfname), method='tesseract')
        os.remove(outfname)  # clean up.
        print(str(text, 'utf8'))
        print('\n<end page pos =' , i, '>\n')

extract_text('/Users/user/Downloads/ImageOnly.pdf')

However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:

sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname)  # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()

Any idea of how to make the page extraction/separator trick and saving everything into a file?...

492

asked Apr 28 '17 05:04

john doe

1 Answers

In your code, you are extracting the text, but you don't do anything with it.

Try something like this:

def extract_txt(file_path):
    text = textract.process(file_path, method='tesseract')
    outfn = file_path[:-4] + '.txt'  # assuming filenames end with '.pdf'
    with open(outfn, 'wb') as output_file:
        output_file.write(text)
    return file_path

This writes the text to file that has the same name but a .txt extension.

It also returns the path of the original file to let the parent know that this file is done.

So I would change the mapping code to:

p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
    print('completed file:', fn)

You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
Because the worker function returned the filename, you can print it to let the user know that this file is done.

Edit 1:

The additional question is if it is possible to mark page boundaries. I think it is.

A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.

Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.

Edit 2:

If you need a file, write a file:

from PyPDF2 import PdfFileWriter, PdfFileReader
import textract

def extract_text(pdf_file):
    inputpdf = PdfFileReader(open(pdf_file, "rb"))
    outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
    with open(outfname, 'w') as textfile:
        for i in range(inputpdf.numPages):
            w = PdfFileWriter()
            w.addPage(inputpdf.getPage(i))
            outfname = 'page{:03d}.pdf'.format(i)
            with open(outfname, 'wb') as outfile:  # I presume you need `wb`.
                w.write(outfile)
            print('page', i)
            text = textract.process(outfname, method='tesseract')
            # Add header and footer.
            text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
            # Write the OCR-ed text to the output file.
            textfile.write(text)
            os.remove(outfname)  # clean up.
            print(text)

101

answered Oct 16 '22 10:10

Roland Smith

Related questions
                            
                                How to join the same table in sqlalchemy
                            
                                Can you change a widget's parent in python tkinter?
                            
                                Interesting "getElementById() takes exactly 1 argument (2 given)", sometimes it occurs. Can someone explain it?
                            
                                Python - Why does extend() and append() return None (void)? [duplicate]
                            
                                Matplotlib - Stepped histogram with already binned data
                            
                                Paging output from print statement
                            
                                Using SignalR server from Python code
                            
                                Weighted moving average in python
                            
                                egg_info directory in VC?
                            
                                Pandas 'isin' with output keeping order of input list
                            
                                TypeError: can't pickle generator objects
                            
                                Celery & RabbitMQ running as docker containers: Received unregistered task of type '...'
                            
                                Referring to existing distutils options inside setup.cfg and setup.py
                            
                                How to unit-test decorated functions?
                            
                                what is the pythonic way to inherit context manager
                            
                                How can I locale-format a python Decimal and preserve its precision?
                            
                                Python asyncio: yield from wasn't used with future?
                            
                                python log formatter that shows all kwargs in extra
                            
                                gensim word2vec accessing in/out vectors
                            
                                How to read webcam in separate process on OSX?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With