Extract Text from Two-Column PDF with R

Tags:

I have a lot of PDFs which are in two-column format. I am using the pdftools package in R. Is there a way to read each PDF according to the two-column format without cropping each PDF individually?

Each PDF consists of selectable text, and the pdf_text function has no problem reading the text, the only issue is that it will read the first line of the first column, then proceed to the next column, instead of moving down the first column.

Thank you very much in advance for your help.

812

asked Mar 01 '17 20:03

tsouchlarakis

1 Answers

I'd the same problem. What I did was to get the most frequent space values for each of my pdfs pages and stored it into a Vector. Then I sliced it using that value.

library(pdftools)
src <- ""
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

QTD_COLUMNS <- 2
read_text <- function(text) {
  result <- ''
  #Get all index of " " from page.
  lstops <- gregexpr(pattern =" ",text)
  #Puts the index of the most frequents ' ' in a vector.
  stops <- as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
  #Slice based in the specified number of colums (this can be improved)
  for(i in seq(1, QTD_COLUMNS, by=1))
  {
    temp_result <- sapply(text, function(x){
      start <- 1
      stop <-stops[i] 
      if(i > 1)            
        start <- stops[i-1] + 1
      if(i == QTD_COLUMNS)#last column, read until end.
        stop <- nchar(x)+1
      substr(x, start=start, stop=stop)
    }, USE.NAMES=FALSE)
    temp_result <- trim(temp_result)
    result <- append(result, temp_result)
  }
  result
}

txt <- pdf_text(src)
result <- ''
for (i in 1:length(txt)) { 
  page <- txt[i]
  t1 <- unlist(strsplit(page, "\n"))      
  maxSize <- max(nchar(t1))
  t1 <- paste0(t1,strrep(" ", maxSize-nchar(t1)))
  result = append(result,read_text(t1))
}
result

200

answered Sep 24 '22 08:09

Felipe Santiago

Related questions
                            
                                ggplot2 facet_grid arrange panels
                            
                                How to name columns in time series objects?
                            
                                checking on success of write.csv in R
                            
                                What does size really mean in geom_point?
                            
                                Why does lapply() not retain my data.table keys?
                            
                                Add vertical lines to quantmod::chart_Series
                            
                                how to replace numbers on X axis by dates when using plot in R?
                            
                                Why is `poly` complaining about degree less than number of unique points?
                            
                                Substitute A for B and B for A in a string
                            
                                Filling bars in barplot with textiles in ggplot2 [duplicate]
                            
                                Linear model (lm) when dependent variable is a factor/categorical variable?
                            
                                Multiple RowSideColor columns heatmap.2 from gplots package
                            
                                r knitr chunk options for figure height / width are not working
                            
                                List of Rcpp sugar functions?
                            
                                Merge data frame with SpatialPolygonsDataFrame
                            
                                Select values from different columns based on a variable containing column names [duplicate]
                            
                                Divide each each cell of large matrix by sum of its row
                            
                                ggplot2: change strip.text position in facet_grid plot
                            
                                Set linetype for geom_vline?
                            
                                Create a default comment header template in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract Text from Two-Column PDF with R

Tags:

r

pdf

pdftools

tsouchlarakis

People also ask

1 Answers

Felipe Santiago

Recent Activity

Donate For Us