The problem is not new on Stackoverflow, but I am pretty sure I am missing something obvious.
I am trying to convert a few .pdf files into .txt files, in order to mine their text. I based my approach on this excellent script. The text in the .pdf files is not composed by images, hence no OCR required.
# Load tm package
library(tm)
# The folder containing my PDFs
dest <- "./pdfs"
# Correctly installed xpdf from http://www.foolabs.com/xpdf/download.html
file.exists(Sys.which(c("pdfinfo", "pdftotext")))
[1] TRUE TRUE
# Delete white spaces from pdfs' names
sapply(myfiles, FUN = function(i){
file.rename(from = i, to = paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})
# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"',
paste0('"', i, '"')), wait = FALSE))
It should create a .txt copy of any .pdf file in the dest
folder. I checked for issues with the path, for white spaces in the path, for xpdf common installation issues but nothing happens.
Here is the repository I am working on. If it can be useful, I can paste the SessionInfo
. Thanks in advance.
Late answer:
But I recently discovered that with the current verions of tm (0.7-4) you can read pdfs directly into a corpus if you have pdftools installed (install.packages("pdftools")
).
library(tm)
directory <- getwd() # change this to directory where pdf-files are located
# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With