Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I parse content from a PDF page with Swift

Tags:

The documentation is not really clear to me. So far I reckon I need to set up a CGPDFOperatorTable and then create a CGPDFContentStreamCreateWithPage and CGPDFScannerCreate per PDF page.

The documentation refers to setting up Callbacks, but it's unclear to me how. How to actually obtain the content from a page?

This is my code so far.

    let pdfURL = NSBundle.mainBundle().URLForResource("titleofdocument", withExtension: "pdf")

    // Create pdf document
    let pdfDoc = CGPDFDocumentCreateWithURL(pdfURL)

    // Nr of pages in this PF
    let numberOfPages = CGPDFDocumentGetNumberOfPages(pdfDoc) as Int

    if numberOfPages <= 0 {
        // The number of pages is zero
        return
    }

    let myTable = CGPDFOperatorTableCreate()

    // lets go through every page
    for pageNr in 1...numberOfPages {

        let thisPage = CGPDFDocumentGetPage(pdfDoc, pageNr)
        let myContentStream = CGPDFContentStreamCreateWithPage(thisPage)
        let myScanner = CGPDFScannerCreate(myContentStream, myTable, nil)

        CGPDFScannerScan(myScanner)

        // Search for Content here?
        // ??

        CGPDFScannerRelease(myScanner)
        CGPDFContentStreamRelease(myContentStream)

    }

    // Release Table
    CGPDFOperatorTableRelease(myTable)

It's a similar question to: PDF Parsing with SWIFT but has no answers yet.

like image 836
0xT0mT0m Avatar asked Mar 20 '16 15:03

0xT0mT0m


People also ask

How do I extract specific text from a PDF?

Once you've opened the file, click on the "Edit" tab, and then click on the "edit" icon. Now you can right-click on the text and select "Copy" to extract the text you need.

Is it possible to parse a PDF file?

A PDF Parser (also sometimes called PDF scraper) is a software that can be used to extract data from PDF documents. PDF Parsers can come in form of libraries for developers or as standalone software products for end-users. PDF Parsers are used mainly to extract data from a batch of PDF files.


1 Answers

Here is an example of the callbacks implemented in Swift:

    let operatorTableRef = CGPDFOperatorTableCreate()

    CGPDFOperatorTableSetCallback(operatorTableRef, "BT") { (scanner, info) in
        print("Begin text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "ET") { (scanner, info) in
        print("End text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "Tf") { (scanner, info) in
        print("Select font")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "Tj") { (scanner, info) in
        print("Show text")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "TJ") { (scanner, info) in
        print("Show text, allowing individual glyph positioning")
    }

    let numPages = CGPDFDocumentGetNumberOfPages(pdfDocument)
    for pageNum in 1...numPages {
        let page = CGPDFDocumentGetPage(pdfDocument, pageNum)
        let stream = CGPDFContentStreamCreateWithPage(page)
        let scanner = CGPDFScannerCreate(stream, operatorTableRef, nil)
        CGPDFScannerScan(scanner)
        CGPDFScannerRelease(scanner)
        CGPDFContentStreamRelease(stream)
    }
like image 163
Matt Green Avatar answered Oct 11 '22 13:10

Matt Green