I'm doing for loop for 13K pdf files, where it reads, pre-processes text, finds similarities and writes in txt. However, when I run the for loop it gives an error
Error in poppler_pdf_text(loadfile(pdf), opw, upw) : Not enough space
What can be the reason?
memory_limit()
, it is also not the issue. Thumbs.db
, but same issue appears again.
folder_path <- "C: ...."
## get vector with all pdf names
pdf_folder <- list.files(folder.path)
## for loop over all pdf documents
for(s in 1:length(pdf_folder)){
## choose one pdf document from vector of strings
pdf_document_name <- pdf_folder[s]
## read pdf_document pdf into data.frame
pdf <- read_pdf(paste0(folder_path,"/",pdf_document_name))
print(s)
rm(pdf)
} ## end of for loop
# Error:
Error in poppler_pdf_text(loadfile(pdf), opw, upw) : Not enough space
The expected outcome is to read all pdf documents in the original path.
I was able to reproduce this error with the following:
This is resolved by updating the pdftools package to v2.3.1.
large_pdf_file <- "path/to/file.pdf"
system.time(test <- textreadr::read_pdf(large_pdf_file))
# user system elapsed
# 165.64 0.42 166.17
dim(test)
# [1] 519871 3
The problem is a possible memory leak in the poppler library which is used by the pdftools package.
The Task Manager shows a huge increase in RAM while using the textreadr::read_pdf
function to read a large image based pdf file.
If you insist on using an older version of pdftools, some users have reported success with this workaround - however, I tried it using the same large pdf file as before and received this error:
pdf <- callr::r(function(){
textreadr::read_pdf('filename.pdf')
})
Error in value[[3L]](cond) :
callr subprocess failed: could not start R, exited with non-zero status,
has crashed or was killed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With