Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read the nth line of a Parsed html in R

The readLines function displays all the content of the source page in one line.

con = url("target_url_here")
htmlcode = readLines(con)

readLines function has concatenated all the lines of the source page in one line. So there is no way I can navigate to the 15th line in the original html source page.

Next approach is to try parsing it using XML package or httr package.

library("httr")
html <- GET("target_url_here")
content2 = content(html,as="text")
parsedHtml = htmlParse(content2,asText=TRUE)

By printing out the parsedHtml, it retains the html format and displays all the contents as it can be seen in the source page. Now suppose I want to extract the title, so the function

xpathSApply(parsedHtml,"//title",xmlValue)

will give the title.

But my question is, how do I navigate to any line say the 15th line of the html? In other words, how can I treat the html as a vector of strings, where each element of the vector is a separate line in the html page/parsed html object.

like image 945
Novneet Nov Avatar asked Aug 17 '14 07:08

Novneet Nov


1 Answers

Having a better look at the docs for readLines(), it actually returns:

A character vector of length the number of lines read.

So in your case:

con = url("http://example.com/file_to_parse.html")
htmlCode = readLines(con)

you can easily do htmlCode[15] to access the 15th line in the original html source page.

like image 98
Marius Butuc Avatar answered Sep 27 '22 20:09

Marius Butuc