The documentation is not really clear to me. So far I reckon I need to set up a CGPDFOperatorTable and then create a CGPDFContentStreamCreateWithPage and CGPDFScannerCreate per PDF page.
The documentation refers to setting up Callbacks, but it's unclear to me how. How to actually obtain the content from a page?
This is my code so far.
let pdfURL = NSBundle.mainBundle().URLForResource("titleofdocument", withExtension: "pdf")
// Create pdf document
let pdfDoc = CGPDFDocumentCreateWithURL(pdfURL)
// Nr of pages in this PF
let numberOfPages = CGPDFDocumentGetNumberOfPages(pdfDoc) as Int
if numberOfPages <= 0 {
// The number of pages is zero
return
}
let myTable = CGPDFOperatorTableCreate()
// lets go through every page
for pageNr in 1...numberOfPages {
let thisPage = CGPDFDocumentGetPage(pdfDoc, pageNr)
let myContentStream = CGPDFContentStreamCreateWithPage(thisPage)
let myScanner = CGPDFScannerCreate(myContentStream, myTable, nil)
CGPDFScannerScan(myScanner)
// Search for Content here?
// ??
CGPDFScannerRelease(myScanner)
CGPDFContentStreamRelease(myContentStream)
}
// Release Table
CGPDFOperatorTableRelease(myTable)
It's a similar question to: PDF Parsing with SWIFT but has no answers yet.
Once you've opened the file, click on the "Edit" tab, and then click on the "edit" icon. Now you can right-click on the text and select "Copy" to extract the text you need.
A PDF Parser (also sometimes called PDF scraper) is a software that can be used to extract data from PDF documents. PDF Parsers can come in form of libraries for developers or as standalone software products for end-users. PDF Parsers are used mainly to extract data from a batch of PDF files.
Here is an example of the callbacks implemented in Swift:
let operatorTableRef = CGPDFOperatorTableCreate()
CGPDFOperatorTableSetCallback(operatorTableRef, "BT") { (scanner, info) in
print("Begin text object")
}
CGPDFOperatorTableSetCallback(operatorTableRef, "ET") { (scanner, info) in
print("End text object")
}
CGPDFOperatorTableSetCallback(operatorTableRef, "Tf") { (scanner, info) in
print("Select font")
}
CGPDFOperatorTableSetCallback(operatorTableRef, "Tj") { (scanner, info) in
print("Show text")
}
CGPDFOperatorTableSetCallback(operatorTableRef, "TJ") { (scanner, info) in
print("Show text, allowing individual glyph positioning")
}
let numPages = CGPDFDocumentGetNumberOfPages(pdfDocument)
for pageNum in 1...numPages {
let page = CGPDFDocumentGetPage(pdfDocument, pageNum)
let stream = CGPDFContentStreamCreateWithPage(page)
let scanner = CGPDFScannerCreate(stream, operatorTableRef, nil)
CGPDFScannerScan(scanner)
CGPDFScannerRelease(scanner)
CGPDFContentStreamRelease(stream)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With