I am quite new to R. I want to compile a 1-million-word corpus of newspaper articles. So I am trying to write a web scraper to retrieve newspaper articles from e.g. the guardian website: http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs. The scraper is meant to start on one page, retrieve the article's body text, remove all tags and save it to a text file. Then it should go to the next article via the links on this page, get the article and so on until the file contains about 1 million words. Unfortunately, I did not get very far with my scraper. I used readLines() to get to the website's source and would now like to get hold of the relevant line in the code. The relevant section in the Guardian uses this id to mark the body text of the article: <pre class="prettyprint"><code><div id="article-body-blocks"> <a href="http://www.guardian.co.uk/politics/boris" title="More from guardian.co.uk on Boris Johnson">Boris Johnson</a>, the...a different approach." </div> </code></pre> I tried to get hold of this section using various expressions with grep and lookbehind - trying to get the line after this id - but I think it does not work across multiple lines. At least I cannot get it to work. Could anybody help out? It would be great if somebody could provide me with some code I can continue working on! Thanks.

You will face the problem of cleaning of the scraped page if you really insist on using <code>grep</code> and <code>readLines</code>, but this can be done of course. Eg.: Load the page: <pre class="prettyprint"><code>html <- readLines('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs') </code></pre> And with the help of <code>str_extract</code> from <code>stringr</code> package and a simple regular expression you are done: <pre class="prettyprint"><code>library(stringr) body <- str_extract(paste(html, collapse='\n'), '<div id="article-body-blocks">.*</div>') </code></pre> Well, <code>body</code> looks ugly, you will have to clean it up from <code></code> and scripts also. This can be done with <code>gsub</code> and friends (nice regular expressions). For example: <pre class="prettyprint"><code>gsub('<script(.*?)script>|<span(.*?)>|<div(.*?)>|</div>||<p(.*?)>|<a(.*?)>|\n|\t', '', body) </code></pre> <hr> As @Andrie suggested, you should rather use some packages build for this purpose. Small demo: <pre class="prettyprint"><code>library(XML) library(RCurl) webpage <- getURL('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs') webpage <- readLines(tc <- textConnection(webpage)); close(tc) pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8') body <- xpathSApply(pagetree, "//div[@id='article-body-blocks']/p", xmlValue) </code></pre> Where <code>body</code> results in a clean text: <pre class="prettyprint"><code>> str(body) chr [1:33] "The deputy prime minister, Nick Clegg, has said the government's regional growth fund will provide a \"snowball effect that cre"| __truncated__ ... </code></pre> <hr> Update:The above as a one-liner (thanks to @Martin Morgan for suggestion): <pre class="prettyprint"><code>xpathSApply(htmlTreeParse('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs', useInternalNodes = TRUE, encoding='UTF-8'), "//div[@id='article-body-blocks']/p", xmlValue) </code></pre>

how to build a webscraper in R using readLines and grep?

Tags:

r

web-scraping

I am quite new to R. I want to compile a 1-million-word corpus of newspaper articles. So I am trying to write a web scraper to retrieve newspaper articles from e.g. the guardian website: http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs.

The scraper is meant to start on one page, retrieve the article's body text, remove all tags and save it to a text file. Then it should go to the next article via the links on this page, get the article and so on until the file contains about 1 million words.

Unfortunately, I did not get very far with my scraper.

I used readLines() to get to the website's source and would now like to get hold of the relevant line in the code.

The relevant section in the Guardian uses this id to mark the body text of the article:

<div id="article-body-blocks">         
  <p>
    <a href="http://www.guardian.co.uk/politics/boris"
       title="More from guardian.co.uk on Boris Johnson">Boris Johnson</a>,
       the...a different approach."
  </p>
</div>

I tried to get hold of this section using various expressions with grep and lookbehind - trying to get the line after this id - but I think it does not work across multiple lines. At least I cannot get it to work.

Could anybody help out? It would be great if somebody could provide me with some code I can continue working on!

Thanks.

520

asked Oct 31 '11 18:10

Kat

1 Answers

You will face the problem of cleaning of the scraped page if you really insist on using grep and readLines, but this can be done of course. Eg.:

Load the page:

html <- readLines('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')

And with the help of str_extract from stringr package and a simple regular expression you are done:

library(stringr)
body <- str_extract(paste(html, collapse='\n'), '<div id="article-body-blocks">.*</div>')

Well, body looks ugly, you will have to clean it up from  and scripts also. This can be done with gsub and friends (nice regular expressions). For example:

gsub('<script(.*?)script>|<span(.*?)>|<div(.*?)>|</div>|</p>|<p(.*?)>|<a(.*?)>|\n|\t', '', body)

As @Andrie suggested, you should rather use some packages build for this purpose. Small demo:

library(XML)
library(RCurl)
webpage <- getURL('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs')
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8')
body <- xpathSApply(pagetree, "//div[@id='article-body-blocks']/p", xmlValue)

Where body results in a clean text:

> str(body)
 chr [1:33] "The deputy prime minister, Nick Clegg, has said the government's regional growth fund will provide a \"snowball effect that cre"| __truncated__ ...

Update:The above as a one-liner (thanks to @Martin Morgan for suggestion):

xpathSApply(htmlTreeParse('http://www.guardian.co.uk/politics/2011/oct/31/nick-clegg-investment-new-jobs', useInternalNodes = TRUE, encoding='UTF-8'), "//div[@id='article-body-blocks']/p", xmlValue)

answered Oct 06 '22 00:10

daroczig

Related questions
                            
                                Delete rows from SQL Server table using R (DBI package)
                            
                                convert numbers written in words to numbers using R programming
                            
                                Preferential removal of partial duplicates in a dataframe
                            
                                ggplot2 - remove panel top border
                            
                                Fill column based on look-up table in R
                            
                                How to split a string into regular intervals in R?
                            
                                Multiply Matrix R
                            
                                R replace multiple variables in a string using a lookup table
                            
                                combining data frames from two lists
                            
                                Mean of an element in list of lists
                            
                                How can I make a discontinuous axis in R with ggplot2?
                            
                                How to make an R function return multiple columns and append them to a data frame?
                            
                                Restructure Data in R
                            
                                R workspaces i.e. .R files
                            
                                Vectorizing a loop in R
                            
                                Saving multiple boxplots
                            
                                Seasonal Adjustment in R or Python
                            
                                how do I search for columns with same name, add the column values and replace these columns with same name by their sum? Using R
                            
                                How to histogram day-of-week, and have string labels
                            
                                Is there a way/site/place/state of mind that lets you see the results of example code in a package's help

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With