Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text from pdf page by page and line by line with PyMuPDF

I have to extract text from existing PDF documents. Currently I use the PyMuPDF module for this. Overall, it works fine and very fast. The problem is, that this tool replaces all horizontal tabs from the pdf documents (for example, in headings: 5 \t Topic) with a new line feed. Since I have to extract the text line by line, this is very impractical for me.

Does anyone know, how to fix this problem or know another method to extract the text page per page and line by line? I also tried tika (here I can't extract the text pagewise) and PyPDF2 (it's horrible: for any formatted text (like written in bold) it puts a new line feed into the extracted text.

for document in pdfPath:
    string_dic[document] = StringIO()
    pdf_file = fitz.open(document)
    number_of_pages = pdf_file.pageCount
    for page_number in range(number_of_pages):
        page = pdf_file.loadPage(page_number)
        page_content = page.getText("text")
        string_dic[document].write(page_content)
        string_dic[document].write(chr(12))
    string_dic[document].seek(0)

When I convert a PDF document with the following content:
5 \t text after a tab
I get the following result after the extraction:
5
text after a tab

like image 517
Rob2cc Avatar asked Nov 24 '25 19:11

Rob2cc


1 Answers

As per the documentation (see also flags here),

page.getText('text', flags=2) 

should work. However, when I tried, it was still having \n rather than \t

Another option you have to get text as a dictionary and look through it to build the text. It is a bit roundabout way but since you get the x0 and x1 position of each span, you can technically calculate whitespaces inbetween and use them

page.getText('dict')

Output

{'width': 612.0,
 'height': 792.0,
 'blocks': [{'type': 0,
   'bbox': (72.28006744384766,
    72.37419891357422,
    156.7176055908203,
    87.02263641357422),
   'lines': [{'wmode': 0,
     'dir': (1.0, 0.0),
     'bbox': (72.28006744384766,
      72.37419891357422,
      78.36209869384766,
      87.02263641357422),
     'spans': [{'size': 12.0,
       'flags': 4,
       'font': 'Calibri',
       'color': 0,
       'text': '5',
       'bbox': (72.28006744384766,
        72.37419891357422,
        78.36209869384766,
        87.02263641357422)}]},
    {'wmode': 0,
     'dir': (1.0, 0.0),
     'bbox': (108.28006744384766,
      72.37419891357422,
      156.7176055908203,
      87.02263641357422),
     'spans': [{'size': 12.0,
       'flags': 4,
       'font': 'Calibri',
       'color': 0,
       'text': 'SomeText',
       'bbox': (108.28006744384766,
        72.37419891357422,
        156.7176055908203,
        87.02263641357422)}]}]}]}
like image 126
Suvin K S Avatar answered Nov 27 '25 09:11

Suvin K S