Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert .pdf to .txt

Tags:

r

pdf

tm

The problem is not new on Stackoverflow, but I am pretty sure I am missing something obvious.

I am trying to convert a few .pdf files into .txt files, in order to mine their text. I based my approach on this excellent script. The text in the .pdf files is not composed by images, hence no OCR required.

# Load tm package
library(tm)

# The folder containing my PDFs
dest <- "./pdfs"

# Correctly installed xpdf from http://www.foolabs.com/xpdf/download.html

file.exists(Sys.which(c("pdfinfo", "pdftotext")))
[1] TRUE TRUE

# Delete white spaces from pdfs' names
sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', 
paste0('"', i, '"')), wait = FALSE)) 

It should create a .txt copy of any .pdf file in the dest folder. I checked for issues with the path, for white spaces in the path, for xpdf common installation issues but nothing happens.

Here is the repository I am working on. If it can be useful, I can paste the SessionInfo. Thanks in advance.

like image 999
Worice Avatar asked Nov 08 '22 13:11

Worice


1 Answers

Late answer:

But I recently discovered that with the current verions of tm (0.7-4) you can read pdfs directly into a corpus if you have pdftools installed (install.packages("pdftools")).

library(tm)

directory <- getwd() # change this to directory where pdf-files are located

# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"), 
                               readerControl = list(reader = readPDF))
like image 193
phiver Avatar answered Nov 15 '22 06:11

phiver