I'm trying to read a folder of pdf files into a dataframe in R. I'm able to read individual pdf files in using the pdftools
library and pdf_text(filepath)
.
Ideally, I could grab the author and title of a series of pdf's that are then pushed into a dataframe that has a column with these so that I can then use basic tidytext
functions on the text.
For a single file right now, I can just use:
library(pdftools)
library(tidytext)
library(dplyr)
txt <- pdf_text("filpath")
txt <- data_frame(txt)
txt %>%
unnest_tokens(word, txt)
Here I have a dataframe with single words. I'd like to get to a dataframe where I have articles unpacked including a title and author column.
To find all the PDFs within a working directory, you can use list.files
with an argument:
all_pdfs <- list.files(pattern = ".pdf$")
The all_pdfs
object will then be a character vector that contains all your filenames.
Then, you can set up a pipe to read in all the PDFs and unnest them using tidytext with a map
function from purrr. You can use a mutate()
within the map()
to annotate each line with the filename, if you'd like.
library(pdftools)
library(tidyverse)
library(tidytext)
map_df(all_pdfs, ~ data_frame(txt = pdf_text(.x)) %>%
mutate(filename = .x) %>%
unnest_tokens(word, txt))
You'll need to do some fancier work to get a title and author column, depending on where you have that information. Maybe with a regex on txt
or filename
before unnesting?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With