I have to extract text from existing PDF documents. Currently I use the PyMuPDF module for this. Overall, it works fine and very fast. The problem is, that this tool replaces all horizontal tabs from the pdf documents (for example, in headings: 5 \t Topic) with a new line feed.
Since I have to extract the text line by line, this is very impractical for me.
Does anyone know, how to fix this problem or know another method to extract the text page per page and line by line? I also tried tika (here I can't extract the text pagewise) and PyPDF2 (it's horrible: for any formatted text (like written in bold) it puts a new line feed into the extracted text.
for document in pdfPath:
string_dic[document] = StringIO()
pdf_file = fitz.open(document)
number_of_pages = pdf_file.pageCount
for page_number in range(number_of_pages):
page = pdf_file.loadPage(page_number)
page_content = page.getText("text")
string_dic[document].write(page_content)
string_dic[document].write(chr(12))
string_dic[document].seek(0)
When I convert a PDF document with the following content:
5 \t text after a tab
I get the following result after the extraction:
5
text after a tab
As per the documentation (see also flags here),
page.getText('text', flags=2)
should work. However, when I tried, it was still having \n rather than \t
Another option you have to get text as a dictionary and look through it to build the text. It is a bit roundabout way but since you get the x0 and x1 position of each span, you can technically calculate whitespaces inbetween and use them
page.getText('dict')
Output
{'width': 612.0,
'height': 792.0,
'blocks': [{'type': 0,
'bbox': (72.28006744384766,
72.37419891357422,
156.7176055908203,
87.02263641357422),
'lines': [{'wmode': 0,
'dir': (1.0, 0.0),
'bbox': (72.28006744384766,
72.37419891357422,
78.36209869384766,
87.02263641357422),
'spans': [{'size': 12.0,
'flags': 4,
'font': 'Calibri',
'color': 0,
'text': '5',
'bbox': (72.28006744384766,
72.37419891357422,
78.36209869384766,
87.02263641357422)}]},
{'wmode': 0,
'dir': (1.0, 0.0),
'bbox': (108.28006744384766,
72.37419891357422,
156.7176055908203,
87.02263641357422),
'spans': [{'size': 12.0,
'flags': 4,
'font': 'Calibri',
'color': 0,
'text': 'SomeText',
'bbox': (108.28006744384766,
72.37419891357422,
156.7176055908203,
87.02263641357422)}]}]}]}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With