Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tidytext read files from folder

Tags:

r

nlp

tidytext

I'm trying to read a folder of pdf files into a dataframe in R. I'm able to read individual pdf files in using the pdftools library and pdf_text(filepath).

Ideally, I could grab the author and title of a series of pdf's that are then pushed into a dataframe that has a column with these so that I can then use basic tidytext functions on the text.

For a single file right now, I can just use:

library(pdftools)
library(tidytext)
library(dplyr)
txt <- pdf_text("filpath")
txt <- data_frame(txt)
txt %>%
     unnest_tokens(word, txt)

Here I have a dataframe with single words. I'd like to get to a dataframe where I have articles unpacked including a title and author column.

like image 551
jfkoehler Avatar asked Mar 08 '23 23:03

jfkoehler


1 Answers

To find all the PDFs within a working directory, you can use list.files with an argument:

all_pdfs <- list.files(pattern = ".pdf$")

The all_pdfs object will then be a character vector that contains all your filenames.

Then, you can set up a pipe to read in all the PDFs and unnest them using tidytext with a map function from purrr. You can use a mutate() within the map() to annotate each line with the filename, if you'd like.

library(pdftools)
library(tidyverse)
library(tidytext)

map_df(all_pdfs, ~ data_frame(txt = pdf_text(.x)) %>%
    mutate(filename = .x) %>%
    unnest_tokens(word, txt))

You'll need to do some fancier work to get a title and author column, depending on where you have that information. Maybe with a regex on txt or filename before unnesting?

like image 116
Julia Silge Avatar answered Mar 11 '23 14:03

Julia Silge