Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract Text from Two-Column PDF with R

Tags:

r

pdf

pdftools

I have a lot of PDFs which are in two-column format. I am using the pdftools package in R. Is there a way to read each PDF according to the two-column format without cropping each PDF individually?

Each PDF consists of selectable text, and the pdf_text function has no problem reading the text, the only issue is that it will read the first line of the first column, then proceed to the next column, instead of moving down the first column.

Thank you very much in advance for your help.

like image 812
tsouchlarakis Avatar asked Mar 01 '17 20:03

tsouchlarakis


People also ask

How do I extract data from a column in PDF?

Extracting data from PDF to Excel You can import a PDF file directly into Excel and extract tabular data from it: Open an Excel sheet. Data tab > Get Data drop-down > From File > From PDF. Select your PDF file & click Import.

Can RStudio read PDF?

RStudio supports previewing PDFs using SyncTeX, which enables high-fidelity synchronization between PDFs and the source files that generated them. SyncTeX functionality is only available when using Sweave and knitr for producing PDFs, not for formats such as RMarkdown or Quarto.


1 Answers

I'd the same problem. What I did was to get the most frequent space values for each of my pdfs pages and stored it into a Vector. Then I sliced it using that value.

library(pdftools)
src <- ""
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

QTD_COLUMNS <- 2
read_text <- function(text) {
  result <- ''
  #Get all index of " " from page.
  lstops <- gregexpr(pattern =" ",text)
  #Puts the index of the most frequents ' ' in a vector.
  stops <- as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
  #Slice based in the specified number of colums (this can be improved)
  for(i in seq(1, QTD_COLUMNS, by=1))
  {
    temp_result <- sapply(text, function(x){
      start <- 1
      stop <-stops[i] 
      if(i > 1)            
        start <- stops[i-1] + 1
      if(i == QTD_COLUMNS)#last column, read until end.
        stop <- nchar(x)+1
      substr(x, start=start, stop=stop)
    }, USE.NAMES=FALSE)
    temp_result <- trim(temp_result)
    result <- append(result, temp_result)
  }
  result
}

txt <- pdf_text(src)
result <- ''
for (i in 1:length(txt)) { 
  page <- txt[i]
  t1 <- unlist(strsplit(page, "\n"))      
  maxSize <- max(nchar(t1))
  t1 <- paste0(t1,strrep(" ", maxSize-nchar(t1)))
  result = append(result,read_text(t1))
}
result
like image 200
Felipe Santiago Avatar answered Sep 24 '22 08:09

Felipe Santiago