I am trying to extract text from a pdf file in golang. See the code below. For some reason, it's printing complete garbage(some random numbers). Here is the pdf. I believe it's possible to extract text since I am able to copy and paste the text from this file.
package main
import (
"bufio"
"bytes"
"fmt"
"io/ioutil"
"log"
"net/http"
"os"
"strings"
pdf "github.com/unidoc/unipdf/v3/model"
)
func main() {
fmt.Println("Enter URL of PDF file:")
reader := bufio.NewReader(os.Stdin)
url, err := reader.ReadString('\n')
if err != nil {
log.Fatal(err)
}
url = strings.TrimSpace(url)
// Fetch PDF from URL.
resp, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
buf, _ := ioutil.ReadAll(resp.Body)
pdfReader, err := pdf.NewPdfReader(bytes.NewReader(buf))
if err != nil {
log.Fatal(err)
}
// Parse PDF file.
isEncrypted, err := pdfReader.IsEncrypted()
if err != nil {
log.Fatal(err)
}
// If PDF is encrypted, exit with message.
if isEncrypted {
fmt.Println("Error: PDF is encrypted.")
os.Exit(1)
}
// Get number of pages.
numPages, err := pdfReader.GetNumPages()
if err != nil {
log.Fatal(err)
}
// Iterate through pages and print text.
for i := 1; i <= numPages; i++ {
page, err := pdfReader.GetPage(i)
if err != nil {
log.Fatal(err)
}
text, err := page.GetAllContentStreams()
if err != nil {
log.Fatal(err)
}
fmt.Println(text)
}
}
I can not find a free, capable Go package to extract text from PDF. Luckily, there are some free CLI tools that can do this.
pdftotext
from Xpdf is a promising choice. See its output:
$ pdftotext -layout -nopgbrk 2023-04-24_BU-12.pdf - | head
ALL INDIA TENNIS ASSOCIATION
As on 24TH April , 2023
BOY'S UNDER-12 2011 BEST BEST 25% BEST POINTS
24TH April , 2023 Eight Eight Eight CUT FOR TTL.
SING. DBLS. DBLS. NO SHOW PTS.
RANK NAME OF PLAYER REG NO. DOB STATE PTS. PTS. PTS. LATE WL Final
1 VIVAAN MIRDHA 432735 08-Apr-11 (RJ) 485 565 141.25 0 797
2 SMIT SACHIN UNDRE 437763 07-Feb-11 (MH) 435 480 120 0 664.25
3 RISHIKESH MANE 436806 15-Jan-11 (MH) 420 380 95 0 619
4 VIRAJ CHOUDHARY 436648 03-Feb-11 (DL) 415 420 105 0 598.75
On Ubuntu, this tool can be installed with this command:
$ sudo apt install poppler-utils
And it's easy to execute it from Go application with the exec
package:
package main
import (
"bytes"
"context"
"fmt"
"os/exec"
)
func main() {
// See "man pdftotext" for more options.
args := []string{
"-layout", // Maintain (as best as possible) the original physical layout of the text.
"-nopgbrk", // Don't insert page breaks (form feed characters) between pages.
"2023-04-24_BU-12.pdf", // The input file.
"-", // Send the output to stdout.
}
cmd := exec.CommandContext(context.Background(), "pdftotext", args...)
var buf bytes.Buffer
cmd.Stdout = &buf
if err := cmd.Run(); err != nil {
fmt.Println(err)
return
}
fmt.Println(buf.String())
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With