Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Not enough space Error appears when running for loop for 13K pdf documents

I'm doing for loop for 13K pdf files, where it reads, pre-processes text, finds similarities and writes in txt. However, when I run the for loop it gives an error

Error in poppler_pdf_text(loadfile(pdf), opw, upw) : Not enough space

What can be the reason?

  1. I tried to increase memory_limit(), it is also not the issue.
  2. I tried to delete hidden files in the folder, like Thumbs.db, but same issue appears again.
  3. I remove pdf files at every iteration.

folder_path <- "C: ...."
## get vector with all pdf names
pdf_folder <- list.files(folder.path)

## for loop over all pdf documents
for(s in 1:length(pdf_folder)){

   ## choose one pdf document from vector of strings
   pdf_document_name <- pdf_folder[s]

   ## read pdf_document pdf into data.frame
   pdf <- read_pdf(paste0(folder_path,"/",pdf_document_name))

   print(s)

   rm(pdf)

} ## end of for loop

# Error: 

Error in poppler_pdf_text(loadfile(pdf), opw, upw) : Not enough space

The expected outcome is to read all pdf documents in the original path.

like image 818
Bakai Baiazbekov Avatar asked Jul 12 '19 17:07

Bakai Baiazbekov


1 Answers

I was able to reproduce this error with the following:

  • Image based pdf (16,702 pages, 161,277 KB)
  • R v3.5.3 64-bit
  • textreadr v0.90
  • pdftools v2.2
  • tesseract v4.0
  • Windows 10 64-bit
  • 16 GB RAM

This is resolved by updating the pdftools package to v2.3.1.

large_pdf_file <- "path/to/file.pdf"

system.time(test <- textreadr::read_pdf(large_pdf_file))
#    user  system elapsed
#  165.64    0.42  166.17

dim(test)
# [1] 519871      3

The problem is a possible memory leak in the poppler library which is used by the pdftools package.

The Task Manager shows a huge increase in RAM while using the textreadr::read_pdf function to read a large image based pdf file.

If you insist on using an older version of pdftools, some users have reported success with this workaround - however, I tried it using the same large pdf file as before and received this error:

pdf <- callr::r(function(){
    textreadr::read_pdf('filename.pdf')
})
   
Error in value[[3L]](cond) : 
  callr subprocess failed: could not start R, exited with non-zero status,
has crashed or was killed
like image 134
Andrew Avatar answered Oct 31 '22 01:10

Andrew