I have a lot of PDFs which are in two-column format. I am using the pdftools
package in R. Is there a way to read each PDF according to the two-column format without cropping each PDF individually?
Each PDF consists of selectable text, and the pdf_text
function has no problem reading the text, the only issue is that it will read the first line of the first column, then proceed to the next column, instead of moving down the first column.
Thank you very much in advance for your help.
Extracting data from PDF to Excel You can import a PDF file directly into Excel and extract tabular data from it: Open an Excel sheet. Data tab > Get Data drop-down > From File > From PDF. Select your PDF file & click Import.
RStudio supports previewing PDFs using SyncTeX, which enables high-fidelity synchronization between PDFs and the source files that generated them. SyncTeX functionality is only available when using Sweave and knitr for producing PDFs, not for formats such as RMarkdown or Quarto.
I'd the same problem. What I did was to get the most frequent space values for each of my pdfs pages and stored it into a Vector. Then I sliced it using that value.
library(pdftools)
src <- ""
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
QTD_COLUMNS <- 2
read_text <- function(text) {
result <- ''
#Get all index of " " from page.
lstops <- gregexpr(pattern =" ",text)
#Puts the index of the most frequents ' ' in a vector.
stops <- as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
#Slice based in the specified number of colums (this can be improved)
for(i in seq(1, QTD_COLUMNS, by=1))
{
temp_result <- sapply(text, function(x){
start <- 1
stop <-stops[i]
if(i > 1)
start <- stops[i-1] + 1
if(i == QTD_COLUMNS)#last column, read until end.
stop <- nchar(x)+1
substr(x, start=start, stop=stop)
}, USE.NAMES=FALSE)
temp_result <- trim(temp_result)
result <- append(result, temp_result)
}
result
}
txt <- pdf_text(src)
result <- ''
for (i in 1:length(txt)) {
page <- txt[i]
t1 <- unlist(strsplit(page, "\n"))
maxSize <- max(nchar(t1))
t1 <- paste0(t1,strrep(" ", maxSize-nchar(t1)))
result = append(result,read_text(t1))
}
result
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With