Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract words from PDF with golang?

I don't understand type conversion. I know this isn't right, all I get is a bunch of hieroglyphs.

f, _ := os.Open("test.pdf") defer f.Close() io.Copy(os.Stdout, f)

I want to work with the strings....

like image 481
omgj Avatar asked Oct 02 '16 04:10

omgj


People also ask

How to read a PDF file in Golang?

What you get is the content of a pdf file, which is not clear text. If you want to read a pdf file in Go, use one of the golang pdf libraries like rsc.io/pdf, or one of those libraries like yob/pdfreader. I doubt there is any 'solid framework' for this kind of stuff.

What is gopdf in go?

gopdf is a simple library for generating PDF document written in Go lang. Unicode subfont embedding. (Chinese, Japanese, Korean, etc.) package main import ( "log" "github.com/signintech/gopdf" ) func main () { pdf := gopdf. GoPdf {} pdf.

How can I extract text from a PDF file?

A better alternative is to use a library like UniPDF, which has the extractor package that can get the same job done in minutes instead of days, while maintaining high precision. The package not only has the ability to parse through textual content inside PDFs, but can also detect tables and extract them into CSV files.

How to generate PDFs without wkhtml?

This post is about generating pdfs without needing wkhtml - by building the pdf from Go itself. To do this we use a library called gofpdf to build the pdf. It quite straightforward for simple documents, but gets more complicated the more you add to it. In our example we add some text as a title and an image just beneath it.


3 Answers

It's because the PDF doesn't only contain the text, but it also contains the formats (fonts, padding, margin, position, shapes, image) information.

In case you need to read the plain text without format. I have forked a repository and implement the function to do that. You can check it at https://github.com/ledongthuc/pdf

I also have put an example, help it useful for you.

package main

import (
    "bytes"
    "fmt"

    "github.com/ledongthuc/pdf"
)

func main() {
    content, err := readPdf("test.pdf") // Read local pdf file
    if err != nil {
        panic(err)
    }
    fmt.Println(content)
    return
}

func readPdf(path string) (string, error) {
    r, err := pdf.Open(path)
    if err != nil {
        return "", err
    }
    totalPage := r.NumPage()

    var textBuilder bytes.Buffer
    for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
        p := r.Page(pageIndex)
        if p.V.IsNull() {
            continue
        }
        textBuilder.WriteString(p.GetPlainText("\n"))
    }
    return textBuilder.String(), nil
}
like image 153
Le Dong Thuc Avatar answered Oct 12 '22 22:10

Le Dong Thuc


I tried some go pdf libs, and found sajari/docconv works like I expect.

easy to use, here is a example:

package main

import (
    "fmt"
    "log"

    "code.sajari.com/docconv"
)

func main() {
    res, err := docconv.ConvertPath("your-file.pdf")
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(res)
}
like image 33
Daoctor Avatar answered Oct 13 '22 00:10

Daoctor


all I get is a bunch of hieroglyphs.

What you get is the content of a pdf file, which is not clear text.

If you want to read a pdf file in Go, use one of the golang pdf libraries like rsc.io/pdf, or one of those libraries like yob/pdfreader.

As mentioned here:

I doubt there is any 'solid framework' for this kind of stuff. PDF format isn't meant to be machine-friendly by design, and AFAIK there is no guaranteed way to parse arbitrary PDFs.

like image 25
VonC Avatar answered Oct 13 '22 00:10

VonC