I am quite new to R. I want to compile a 1-million-word corpus of newspaper articles. So I am trying to write a web scraper to retrieve newspaper articles from e.g. the guardian website: http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs.
The scraper is meant to start on one page, retrieve the article's body text, remove all tags and save it to a text file. Then it should go to the next article via the links on this page, get the article and so on until the file contains about 1 million words.
Unfortunately, I did not get very far with my scraper.
I used readLines() to get to the website's source and would now like to get hold of the relevant line in the code.
The relevant section in the Guardian uses this id to mark the body text of the article:
<div id="article-body-blocks">
<p>
<a href="http://www.guardian.co.uk/politics/boris"
title="More from guardian.co.uk on Boris Johnson">Boris Johnson</a>,
the...a different approach."
</p>
</div>
I tried to get hold of this section using various expressions with grep and lookbehind - trying to get the line after this id - but I think it does not work across multiple lines. At least I cannot get it to work.
Could anybody help out? It would be great if somebody could provide me with some code I can continue working on!
Thanks.
In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.
Web Scraping With R Just like Python, R is a web scraping programming language used by statisticians and data hunters to compute, collect, and analyze data. R has become a very popular language thanks to the quality of plots that the user can work out.
rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.
You will face the problem of cleaning of the scraped page if you really insist on using grep
and readLines
, but this can be done of course. Eg.:
Load the page:
html <- readLines('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')
And with the help of str_extract
from stringr
package and a simple regular expression you are done:
library(stringr)
body <- str_extract(paste(html, collapse='\n'), '<div id="article-body-blocks">.*</div>')
Well, body
looks ugly, you will have to clean it up from <p>
and scripts also. This can be done with gsub
and friends (nice regular expressions). For example:
gsub('<script(.*?)script>|<span(.*?)>|<div(.*?)>|</div>|</p>|<p(.*?)>|<a(.*?)>|\n|\t', '', body)
As @Andrie suggested, you should rather use some packages build for this purpose. Small demo:
library(XML)
library(RCurl)
webpage <- getURL('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8')
body <- xpathSApply(pagetree, "//div[@id='article-body-blocks']/p", xmlValue)
Where body
results in a clean text:
> str(body)
chr [1:33] "The deputy prime minister, Nick Clegg, has said the government's regional growth fund will provide a \"snowball effect that cre"| __truncated__ ...
Update:The above as a one-liner (thanks to @Martin Morgan for suggestion):
xpathSApply(htmlTreeParse('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs', useInternalNodes = TRUE, encoding='UTF-8'), "//div[@id='article-body-blocks']/p", xmlValue)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With