The readLines function displays all the content of the source page in one line.
con = url("target_url_here")
htmlcode = readLines(con)
readLines function has concatenated all the lines of the source page in one line. So there is no way I can navigate to the 15th line in the original html source page.
Next approach is to try parsing it using XML package or httr package.
library("httr")
html <- GET("target_url_here")
content2 = content(html,as="text")
parsedHtml = htmlParse(content2,asText=TRUE)
By printing out the parsedHtml, it retains the html format and displays all the contents as it can be seen in the source page. Now suppose I want to extract the title, so the function
xpathSApply(parsedHtml,"//title",xmlValue)
will give the title.
But my question is, how do I navigate to any line say the 15th line of the html? In other words, how can I treat the html as a vector of strings, where each element of the vector is a separate line in the html page/parsed html object.
Having a better look at the docs for readLines()
, it actually returns:
A character vector of length the number of lines read.
So in your case:
con = url("http://example.com/file_to_parse.html")
htmlCode = readLines(con)
you can easily do htmlCode[15]
to access the 15th line in the original html source page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With