Not enough space Error appears when running for loop for 13K pdf documents

Question

I'm doing for loop for 13K pdf files, where it reads, pre-processes text, finds similarities and writes in txt. However, when I run the for loop it gives an error

Error in poppler_pdf_text(loadfile(pdf), opw, upw) : Not enough space

What can be the reason?

I tried to increase memory_limit(), it is also not the issue.
I tried to delete hidden files in the folder, like Thumbs.db, but same issue appears again.
I remove pdf files at every iteration.


folder_path <- "C: ...."
## get vector with all pdf names
pdf_folder <- list.files(folder.path)

## for loop over all pdf documents
for(s in 1:length(pdf_folder)){

   ## choose one pdf document from vector of strings
   pdf_document_name <- pdf_folder[s]

   ## read pdf_document pdf into data.frame
   pdf <- read_pdf(paste0(folder_path,"/",pdf_document_name))

   print(s)

   rm(pdf)

} ## end of for loop

# Error: 

Error in poppler_pdf_text(loadfile(pdf), opw, upw) : Not enough space

The expected outcome is to read all pdf documents in the original path.

Andrew · Accepted Answer

I was able to reproduce this error with the following:

Image based pdf (16,702 pages, 161,277 KB)
R v3.5.3 64-bit
textreadr v0.90
pdftools v2.2
tesseract v4.0
Windows 10 64-bit
16 GB RAM

This is resolved by updating the pdftools package to v2.3.1.

large_pdf_file <- "path/to/file.pdf"

system.time(test <- textreadr::read_pdf(large_pdf_file))
#    user  system elapsed
#  165.64    0.42  166.17

dim(test)
# [1] 519871      3

The problem is a possible memory leak in the poppler library which is used by the pdftools package.

The Task Manager shows a huge increase in RAM while using the textreadr::read_pdf function to read a large image based pdf file.

If you insist on using an older version of pdftools, some users have reported success with this workaround - however, I tried it using the same large pdf file as before and received this error:

pdf <- callr::r(function(){
    textreadr::read_pdf('filename.pdf')
})
   
Error in value[[3L]](cond) : 
  callr subprocess failed: could not start R, exited with non-zero status,
has crashed or was killed

Not enough space Error appears when running for loop for 13K pdf documents

Tags:

r

batch-processing

Bakai Baiazbekov

1 Answers

Andrew

Recent Activity

Donate For Us

Not enough space Error appears when running for loop for 13K pdf documents

Tags:

r

batch-processing

Bakai Baiazbekov

1 Answers

Andrew

Related questions

Recent Activity

Donate For Us