Is there any way to get the page numbers in a PDF of a search pattern?

Question

I have a PDF named test.pdf and I need to search for text My name in that PDF.

By using this script, I can do the job:

pdftotext test.pdf - | grep 'My name'

Is there any way to get the page number up to the text "My name" in terminal itself?

rici · Accepted Answer

If you just want the linear page number (as opposed to the number which appears on the page), then you can do it by counting form-feed characters while you search for your text. pdftotext puts a form-feed at the end of every page, so the number of form-feeds prior to your text is one less than the (linear) page number the text is on. (Or thereabouts. Sometimes PDF files are not what they seem.)

Something like the following should work:

pdftotext test.pdf - |
awk -vRS=$'\f' -vNAME="My name" \
    'index($0,NAME){printf "%d: %s
", NR, NAME;}'

The following slightly more complicated solution will prove useful if you want to scan for more than one pattern. Unlike the simple solution above, this one will give you one line per pattern match, even if the same pattern matches twice on the same page:

pdftotext test.pdf - |
grep -F -o -e $'\f' -e 'My name' |
awk 'BEGIN{page=1} /\f/{++page;next} 1{printf "%d: %s
", page, $0;}'

You can add as many patterns as you like to the grep command (by adding another -e string argument). The -F causes it to match exact strings, but that's not essential; you could use -E and a regex. The awk script assumes that all of the matches will either be a form-feed or a string that was matched, which is what you will get with the -o option to grep.

If you are looking for phrases, you should be aware that they might have line breaks (or even page breaks) in the middle. There's not a lot you can do about page breaks, but the first (pure awk) solution will handle line breaks if you change the call to index to a regular expression search, and write the regular expression with [[:space::]]+ replacing every single space in the original phrase:

pdftotext test.pdf - |
awk -vRS=$'\f' \
    '/My[[:space:]]+Name/{printf "%d: %s
", NR, "My Name";}'

In theory, you could extract the visible page number (or "page label" as it is called), but many PDF files do not retain this metadata and you'd need a real PDF parser to extract it.

Is there any way to get the page numbers in a PDF of a search pattern?

Tags:

linux

bash

terminal

pdf

pdf-generation

Zam

1 Answers

rici

Recent Activity

Donate For Us

Is there any way to get the page numbers in a PDF of a search pattern?

Tags:

linux

bash

terminal

pdf

pdf-generation

Zam

1 Answers

rici

Related questions

Recent Activity

Donate For Us