Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text from pdf using golang?

I am trying to extract text from a pdf file in golang. See the code below. For some reason, it's printing complete garbage(some random numbers). Here is the pdf. I believe it's possible to extract text since I am able to copy and paste the text from this file.

package main

import (
    "bufio"
    "bytes"
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "os"
    "strings"
    pdf "github.com/unidoc/unipdf/v3/model"
)

func main() {
    fmt.Println("Enter URL of PDF file:")
    reader := bufio.NewReader(os.Stdin)
    url, err := reader.ReadString('\n')
    if err != nil {
        log.Fatal(err)
    }
    url = strings.TrimSpace(url)

    // Fetch PDF from URL.
    resp, err := http.Get(url)
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()
    buf, _ := ioutil.ReadAll(resp.Body)
    pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
    if err != nil {
        log.Fatal(err)
    }

    // Parse PDF file.
    isEncrypted, err := pdfReader.IsEncrypted()
    if err != nil {
        log.Fatal(err)
    }

    // If PDF is encrypted, exit with message.
    if isEncrypted {
        fmt.Println("Error: PDF is encrypted.")
        os.Exit(1)
    }

    // Get number of pages.
    numPages, err := pdfReader.GetNumPages()
    if err != nil {
        log.Fatal(err)
    }
    // Iterate through pages and print text.
    for i := 1; i <= numPages; i++ {
        page, err := pdfReader.GetPage(i)
        if err != nil {
            log.Fatal(err)
        }
        text, err := page.GetAllContentStreams()
        if err != nil {
            log.Fatal(err)
        }
        fmt.Println(text)
    }
}
like image 736
simplfuzz Avatar asked Sep 02 '25 10:09

simplfuzz


1 Answers

I can not find a free, capable Go package to extract text from PDF. Luckily, there are some free CLI tools that can do this.

pdftotext from Xpdf is a promising choice. See its output:

$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
                           ALL INDIA TENNIS ASSOCIATION
                                        As on 24TH April , 2023
       BOY'S UNDER-12                                 2011                BEST    BEST    25% BEST POINTS
       24TH April , 2023                                                  Eight   Eight     Eight  CUT FOR     TTL.
                                                                          SING.   DBLS.     DBLS. NO SHOW      PTS.
RANK   NAME OF PLAYER                     REG NO.      DOB       STATE     PTS.   PTS.       PTS.  LATE WL    Final
  1    VIVAAN MIRDHA                      432735    08-Apr-11      (RJ)    485     565     141.25     0        797
  2    SMIT SACHIN UNDRE                  437763    07-Feb-11    (MH)      435     480       120      0      664.25
  3    RISHIKESH MANE                     436806    15-Jan-11    (MH)      420     380        95      0        619
  4    VIRAJ CHOUDHARY                    436648    03-Feb-11      (DL)    415     420       105      0      598.75

On Ubuntu, this tool can be installed with this command:

$ sudo apt install poppler-utils

And it's easy to execute it from Go application with the exec package:

package main

import (
    "bytes"
    "context"
    "fmt"
    "os/exec"
)

func main() {
    // See "man pdftotext" for more options.
    args := []string{
        "-layout",              // Maintain (as best as possible) the original physical layout of the text.
        "-nopgbrk",             // Don't insert page breaks (form feed characters) between pages.
        "2023-04-24_BU-12.pdf", // The input file.
        "-",                    // Send the output to stdout.
    }
    cmd := exec.CommandContext(context.Background(), "pdftotext", args...)

    var buf bytes.Buffer
    cmd.Stdout = &buf

    if err := cmd.Run(); err != nil {
        fmt.Println(err)
        return
    }

    fmt.Println(buf.String())
}
like image 181
Zeke Lu Avatar answered Sep 04 '25 01:09

Zeke Lu