Extract text from pdf page by page and line by line with PyMuPDF

Question

I have to extract text from existing PDF documents. Currently I use the PyMuPDF module for this. Overall, it works fine and very fast. The problem is, that this tool replaces all horizontal tabs from the pdf documents (for example, in headings: 5 Topic) with a new line feed. Since I have to extract the text line by line, this is very impractical for me.

Does anyone know, how to fix this problem or know another method to extract the text page per page and line by line? I also tried tika (here I can't extract the text pagewise) and PyPDF2 (it's horrible: for any formatted text (like written in bold) it puts a new line feed into the extracted text.

for document in pdfPath:
    string_dic[document] = StringIO()
    pdf_file = fitz.open(document)
    number_of_pages = pdf_file.pageCount
    for page_number in range(number_of_pages):
        page = pdf_file.loadPage(page_number)
        page_content = page.getText("text")
        string_dic[document].write(page_content)
        string_dic[document].write(chr(12))
    string_dic[document].seek(0)

When I convert a PDF document with the following content:
5 text after a tab
I get the following result after the extraction:
5
text after a tab

Suvin K S · Accepted Answer

As per the documentation (see also flags here),

page.getText('text', flags=2)

should work. However, when I tried, it was still having rather than

Another option you have to get text as a dictionary and look through it to build the text. It is a bit roundabout way but since you get the x0 and x1 position of each span, you can technically calculate whitespaces inbetween and use them

page.getText('dict')

Output

{'width': 612.0,
 'height': 792.0,
 'blocks': [{'type': 0,
   'bbox': (72.28006744384766,
    72.37419891357422,
    156.7176055908203,
    87.02263641357422),
   'lines': [{'wmode': 0,
     'dir': (1.0, 0.0),
     'bbox': (72.28006744384766,
      72.37419891357422,
      78.36209869384766,
      87.02263641357422),
     'spans': [{'size': 12.0,
       'flags': 4,
       'font': 'Calibri',
       'color': 0,
       'text': '5',
       'bbox': (72.28006744384766,
        72.37419891357422,
        78.36209869384766,
        87.02263641357422)}]},
    {'wmode': 0,
     'dir': (1.0, 0.0),
     'bbox': (108.28006744384766,
      72.37419891357422,
      156.7176055908203,
      87.02263641357422),
     'spans': [{'size': 12.0,
       'flags': 4,
       'font': 'Calibri',
       'color': 0,
       'text': 'SomeText',
       'bbox': (108.28006744384766,
        72.37419891357422,
        156.7176055908203,
        87.02263641357422)}]}]}]}

Extract text from pdf page by page and line by line with PyMuPDF

Tags:

python

text-extraction

Rob2cc

1 Answers

Suvin K S

Recent Activity

Donate For Us

Extract text from pdf page by page and line by line with PyMuPDF

Tags:

python

text-extraction

Rob2cc

1 Answers

Suvin K S

Related questions

Recent Activity

Donate For Us