Use R to convert PDF files to text files for text mining

Tags:

I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following:

dest <- "~/A1.pdf"

# set path to pdftotxt.exe and convert pdf to text
exe <- "C:/Program Files (x86)/xpdfbin-win-3.03/bin32/pdftotext.exe"
system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)

# get txt-file name and open it
filetxt <- sub(".pdf", ".txt", dest)
shell.exec(filetxt)

By this, I am converting one pdf file to one .txt file and then copying the abstract in another .txt file and compile it manually. This work is troublesome.

How can I read all individual articles from the folder and convert them into .txt file which contain only the abstract from each article. It can be done by limiting the content between ABSTRACT and INTRODUCTION in each article; but I am not able to do so. Any help is appreciated.

957

asked Jan 30 '14 00:01

S Das

Video Answer

1 Answers

Yes, not really an R question as IShouldBuyABoat notes, but something that R can do with only minor contortions...

Use R to convert PDF files to txt files...

# folder with 1000s of PDFs
dest <- "C:\\Users\\Desktop"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# convert each PDF file that is named in the vector into a text file 
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', 
             paste0('"', i, '"')), wait = FALSE) )

Extract only abstracts from txt files...

# if you just want the abstracts, we can use regex to extract that part of
# each txt file, Assumes that the abstract is always between the words 'Abstract'
# and 'Introduction'
mytxtfiles <- list.files(path = dest, pattern = "txt",  full.names = TRUE)
abstracts <- lapply(mytxtfiles, function(i) {
  j <- paste0(scan(i, what = character()), collapse = " ")
  regmatches(j, gregexpr("(?<=Abstract).*?(?=Introduction)", j, perl=TRUE))
})

Write abstracts into separate txt files...

# write abstracts as txt files 
# (or use them in the list for whatever you want to do next)
lapply(1:length(abstracts),  function(i) write.table(abstracts[i], file=paste(mytxtfiles[i], "abstract", "txt", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))

And now you're ready to do some text mining on the abstracts.

answered Sep 24 '22 04:09

Ben

Related questions
                            
                                Using R cut function on dates
                            
                                asymmetric color distribution in scale_gradient2?
                            
                                Setting column name in "group by" operation with data.table
                            
                                Update subset of data.table based on join
                            
                                Join R data.tables where key values are not exactly equal--combine rows with closest times
                            
                                How can I put a transformed scale on the right side of a ggplot2?
                            
                                Use stat_summary in ggplot2 to calculate the mean and sd, then connect mean points of error bars
                            
                                I cannot connect postgresql schema.table with dplyr package
                            
                                Regression tables in Markdown format (for flexible use in R Markdown v2)
                            
                                specifying "skip NA" when calculating mean of the column in a data frame created by Pandas
                            
                                r Remove parts of column name after certain characters
                            
                                How do I handle multiple kinds of missingness in R?
                            
                                Convert R list to dataframe with missing/NULL elements
                            
                                how to add layers in ggplot using a for-loop
                            
                                ggplot2: Logistic Regression - plot probabilities and regression line
                            
                                I get error "Error in nnet.default(x, y, w, ...) : too many (77031) weights" while training neural networks
                            
                                Plot Size - Using ggplot2 in IPython Notebook (via rmagic)
                            
                                How to adjust title position in ggplot2
                            
                                Find the most frequent value by row
                            
                                How to use ggplot2 to generate a pie graph?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use R to convert PDF files to text files for text mining

Tags:

r

text-mining

tm

pdftotext

S Das

People also ask

Video Answer

1 Answers

Ben

Recent Activity

Donate For Us