Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get specified text pos through xpdf or mupdf?

I want to extract some specified text in pdf files and the text position.

I know xpdf and mupdf can parse pdf files,so i think they may help me to fulfill this task.

But how to use these two lib to get text position?

like image 718
PDF1001 Avatar asked Dec 10 '25 14:12

PDF1001


1 Answers

If you don't mind using a Python binding for MuPDF, here is a Python solution using PyMuPDF (I am one of its developers):

import fitz                     # the PyMuPDF module
doc = fitz.open("input.pdf")    # PDF input file
page = doc[n]                   # page number n (0-based)
wordlist = page.getTextWords()  # gives you a list of all words on the
# page, together with their position info (a rectangle containing the word)

# or, if you only are interested in blocks of lines belonging together:
blocklist = page.getTextBlocks()

# If you need yet more details, use a JSON-based output, which also gives
# images and their positions, as well as font information for the text.
tdict = json.loads(page.getText("json"))

We are on GitHub if you are interested.

like image 169
Jorj McKie Avatar answered Dec 13 '25 19:12

Jorj McKie



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!