I have a PDF named test.pdf and I need to search for text My name in that PDF.
By using this script, I can do the job:
pdftotext test.pdf - | grep 'My name'
Is there any way to get the page number up to the text "My name" in terminal itself?
If you just want the linear page number (as opposed to the number which appears on the page), then you can do it by counting form-feed characters while you search for your text. pdftotext puts a form-feed at the end of every page, so the number of form-feeds prior to your text is one less than the (linear) page number the text is on. (Or thereabouts. Sometimes PDF files are not what they seem.)
Something like the following should work:
pdftotext test.pdf - |
awk -vRS=$'\f' -vNAME="My name" \
'index($0,NAME){printf "%d: %s\n", NR, NAME;}'
The following slightly more complicated solution will prove useful if you want to scan for more than one pattern. Unlike the simple solution above, this one will give you one line per pattern match, even if the same pattern matches twice on the same page:
pdftotext test.pdf - |
grep -F -o -e $'\f' -e 'My name' |
awk 'BEGIN{page=1} /\f/{++page;next} 1{printf "%d: %s\n", page, $0;}'
You can add as many patterns as you like to the grep command (by adding another -e string argument). The -F causes it to match exact strings, but that's not essential; you could use -E and a regex. The awk script assumes that all of the matches will either be a form-feed or a string that was matched, which is what you will get with the -o option to grep.
If you are looking for phrases, you should be aware that they might have line breaks (or even page breaks) in the middle. There's not a lot you can do about page breaks, but the first (pure awk) solution will handle line breaks if you change the call to index to a regular expression search, and write the regular expression with [[:space::]]+ replacing every single space in the original phrase:
pdftotext test.pdf - |
awk -vRS=$'\f' \
'/My[[:space:]]+Name/{printf "%d: %s\n", NR, "My Name";}'
In theory, you could extract the visible page number (or "page label" as it is called), but many PDF files do not retain this metadata and you'd need a real PDF parser to extract it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With