I don't understand type conversion. I know this isn't right, all I get is a bunch of hieroglyphs. <code>f, _ := os.Open("test.pdf") defer f.Close() io.Copy(os.Stdout, f)</code> I want to work with the strings....

I tried some go pdf libs, and found sajari/docconv works like I expect. easy to use, here is a example: <pre class="prettyprint"><code>package main import ( "fmt" "log" "code.sajari.com/docconv" ) func main() { res, err := docconv.ConvertPath("your-file.pdf") if err != nil { log.Fatal(err) } fmt.Println(res) } </code></pre>

<blockquote> all I get is a bunch of hieroglyphs. </blockquote> What you get is the content of a pdf file, which is not clear text. If you want to read a pdf file in Go, use one of the golang pdf libraries like <code>rsc.io/pdf</code>, or one of those libraries like <code>yob/pdfreader</code>. As mentioned here: <blockquote> I doubt there is any 'solid framework' for this kind of stuff. PDF format isn't meant to be machine-friendly by design, and AFAIK there is no guaranteed way to parse arbitrary PDFs. </blockquote>

Extract words from PDF with golang?

3 Answers

It's because the PDF doesn't only contain the text, but it also contains the formats (fonts, padding, margin, position, shapes, image) information.

In case you need to read the plain text without format. I have forked a repository and implement the function to do that. You can check it at https://github.com/ledongthuc/pdf

I also have put an example, help it useful for you.

Click to copy

package main

import (
    "bytes"
    "fmt"

    "github.com/ledongthuc/pdf"
)

func main() {
    content, err := readPdf("test.pdf") // Read local pdf file
    if err != nil {
        panic(err)
    }
    fmt.Println(content)
    return
}

func readPdf(path string) (string, error) {
    r, err := pdf.Open(path)
    if err != nil {
        return "", err
    }
    totalPage := r.NumPage()

    var textBuilder bytes.Buffer
    for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
        p := r.Page(pageIndex)
        if p.V.IsNull() {
            continue
        }
        textBuilder.WriteString(p.GetPlainText("\n"))
    }
    return textBuilder.String(), nil
}

153

answered Oct 12 '22 22:10

Le Dong Thuc

I tried some go pdf libs, and found sajari/docconv works like I expect.

easy to use, here is a example:

Click to copy

package main

import (
    "fmt"
    "log"

    "code.sajari.com/docconv"
)

func main() {
    res, err := docconv.ConvertPath("your-file.pdf")
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(res)
}

answered Oct 13 '22 00:10

Daoctor

all I get is a bunch of hieroglyphs.

What you get is the content of a pdf file, which is not clear text.

If you want to read a pdf file in Go, use one of the golang pdf libraries like rsc.io/pdf, or one of those libraries like yob/pdfreader.

As mentioned here:

I doubt there is any 'solid framework' for this kind of stuff. PDF format isn't meant to be machine-friendly by design, and AFAIK there is no guaranteed way to parse arbitrary PDFs.

answered Oct 13 '22 00:10

VonC

Related questions
                            
                                Generate PDF report from php
                            
                                Why is there a left and top padding in a cell using TCPDF in php?
                            
                                Generate PDF with digital signature by nodejs
                            
                                why letter 'f' oftentimes cannot be copied from text in pdf files?
                            
                                Python print pdf file with win32print
                            
                                How to add metadata to PDF document using PDFbox?
                            
                                Prawn & Prawnto Rails PDF generation - UTF-8?
                            
                                Set filename of the Pdf that is streamed back to the browser
                            
                                Show PDF file in App
                            
                                Jasper Reports: Font extension not working
                            
                                Add text with PdfPages - matplotlib
                            
                                Is it possible to combine a series of PDFs into one using Ruby?
                            
                                AngularJS Display PDF (byte[]) received from Spring @RestController
                            
                                generate PDF from Android using PrintedPdfDocument and View.draw
                            
                                Is my pdf file encoded in UTF-8?
                            
                                How can I tell the resolution of scanned PDF from within a shell script?
                            
                                RDLC Embedded Image Distorted In PDF
                            
                                iTextSharp Scaling image to be full-page
                            
                                ASP.NET MVC open pdf file in new window
                            
                                Embed PDF in mobile browsers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract words from PDF with golang?

Tags:

pdf

go

text-analysis

omgj

People also ask

3 Answers

Le Dong Thuc

Daoctor

VonC

Recent Activity

Donate For Us