Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.

Ok, after looking I found a solution for exporting highlighted text from a pdf to a text file. Is not very hard: <ol> <li>First, you highlight your text with the tool you like to use (in my case, I highlight while I'm reading on an iPad using Goodreader app).</li> <li>Transfer your pdf to a computer and open it using Skim (a pdf reader, free and easy to find on the web)</li> <li>On FILE, choose CONVERT NOTES and convert all the notes of your document to SKIM NOTES.</li> <li>That's all: simply go to EXPORT an choose EXPORT SKIM NOTES. It will export you a list of your highlighted text. Once opened this list can be exported again to a txt format file.</li> </ol> Not much work to do, and the result is fantastic.

How to extract Highlighted Parts from PDF files

2 Answers

To extract highlighted parts, you can use PyMuPDF. Here is an example which works with this pdf file:

# Based on https://stackoverflow.com/a/62859169/562769

from typing import List, Tuple

import fitz  # install with 'pip install pymupdf'


def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
    points = annot.vertices
    quad_count = int(len(points) / 4)
    sentences = []
    for i in range(quad_count):
        # where the highlighted part is
        r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect

        words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
        sentences.append(" ".join(w[4] for w in words))
    sentence = " ".join(sentences)
    return sentence


def handle_page(page):
    wordlist = page.getText("words")  # list of words on page
    wordlist.sort(key=lambda w: (w[3], w[0]))  # ascending y, then x

    highlights = []
    annot = page.firstAnnot
    while annot:
        if annot.type[0] == 8:
            highlights.append(_parse_highlight(annot, wordlist))
        annot = annot.next
    return highlights


def main(filepath: str) -> List:
    doc = fitz.open(filepath)

    highlights = []
    for page in doc:
        highlights += handle_page(page)

    return highlights


if __name__ == "__main__":
    print(main("PDF-export-example-with-notes.pdf"))

144

answered Sep 26 '22 02:09

Martin Thoma

Ok, after looking I found a solution for exporting highlighted text from a pdf to a text file. Is not very hard:

First, you highlight your text with the tool you like to use (in my case, I highlight while I'm reading on an iPad using Goodreader app).
Transfer your pdf to a computer and open it using Skim (a pdf reader, free and easy to find on the web)
On FILE, choose CONVERT NOTES and convert all the notes of your document to SKIM NOTES.
That's all: simply go to EXPORT an choose EXPORT SKIM NOTES. It will export you a list of your highlighted text. Once opened this list can be exported again to a txt format file.

Not much work to do, and the result is fantastic.

answered Sep 24 '22 02:09

Angel

Related questions
                            
                                Getting Image from drawable and adding to PDF using iText
                            
                                How to install, test, convert, resize PDF using ImageMagick, Ghostscript, Windows Vista/7 x64
                            
                                Delphi XE2: Display PDF in a Delphi FireMonkey app on OSX
                            
                                PDF Parsing with SWIFT
                            
                                how to open pdf markup programmatically in ios 11
                            
                                JavaScript working in Acrobat but not Reader
                            
                                Print Pdf document Barcode generated by font in C#
                            
                                jsPDF addHTML exporting low quality image to PDF
                            
                                How to embed a Base64 encoded PDF data URI into a HTML 5 `<object>` data attribute?
                            
                                Manipulating fillable PDFs in Elixir/Erlang
                            
                                PDF-Forms with Unicode chars [closed]
                            
                                Custom page size in iTextSharp in C#.NET
                            
                                Show a pdf stream in a new window
                            
                                Chart from chart.js to pdf
                            
                                Changing the text and background color with Apple's PDFKit framework
                            
                                Render rdlc to pdf in azure website
                            
                                Where can I a mapping of Identity-H encoded characters to ASCII or Unicode characters?
                            
                                Digitally sign a pdf document in iOS
                            
                                Converting Canvas to Pdf using JsPDF
                            
                                How to get a PDF/print version of kubernetes' docs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract Highlighted Parts from PDF files

Tags:

pdf

information-retrieval

user1183057

People also ask

2 Answers

Martin Thoma

Angel

Recent Activity

Donate For Us