Reading in HTML/XML PDF file formats into R

Question

I am trying to parse a pdf by reading it into R as either an HTML/XML file. I am aware that I could read it in using the pdftools package. However if I were to read the link just as an HTML/XML file, I have been unable to access the data inside.

library(xml2)
library(XML)
html_string="https://mchb.hrsa.gov/whusa11/hstat/hsrmh/downloads/pdf/233ml.pdf"
ht <-read_html(html_string)
nodes<-xml_find_all(ht, ".//body")


> ht
{xml_document}
<html>
 [1] <body><p>%PDF-1.6
%\xe2ãÏÓ
83 0 obj
&lt;&gt;stream
hÞ\u009cTË\u008eÓ@äSú'»çÑ3\u0096V+EA\"V«$·\u ...
 [2] <html><p>\u009d@a ö¯\u0088Î÷Ü\&amp;ÔÈýÐâÿZO^"j[FoQ)ÒÇq
\u009b\u008dx\u0085\u008eß±µ\u009bõo	\u008f6¢ ...

> ht[1]
  $node
  <pointer: 0x00000000047901a0>

I tried the following functions as well

xmlTreeParse
xmlToList
xmlParse

How do I access the xml document content string inside? I am tring to make them objects that I can manipulate.

ava · Accepted Answer

A possible solution using pdfx

# download file to your home dir
download.file("https://mchb.hrsa.gov/whusa11/hstat/hsrmh/downloads/pdf/233ml.pdf","233ml.pdf")

# get packages
library(remotes)
remotes::install_github("sckott/extractr")
library(extractr)

#parse
pdfx(file="233ml.pdf", what="parsed")

captcoma · Answer

Your xml_document ht includes 1x body and 13x html you can use html_node or html_nodes from rvest to extract the pieces you need.

library(xml2)
library(XML)
library(rvest)
library(dplyr)

html_string="https://mchb.hrsa.gov/whusa11/hstat/hsrmh/downloads/pdf/233ml.pdf"
ht <-read_html(html_string)

ht %>% html_nodes("html") # look at all html nodes
ht %>% html_node("body") # look at body node

Accordind to your question it looks like you would like to have the body node as text, right?

You can get it with:

ht %>% html_node("body") %>% as.character -> text #get body node as text
text    
[1] "<body><p>%PDF-1.6
%\xe2ãÏÓ
83 0 obj
&lt;&g...

Reading in HTML/XML PDF file formats into R

Tags:

html

r

xml

jessica

2 Answers

ava

captcoma

Recent Activity

Donate For Us

Reading in HTML/XML PDF file formats into R

Tags:

html

r

xml

jessica

2 Answers

ava

captcoma

Related questions

Recent Activity

Donate For Us