how to scrape messages from web based forums with rvest

Question

Take a vbulletin site like the one in the example. I want to be able to scrape just the text messages from the threads. However the css selectors for the messages are called #post_message_xxx where xxx is a variable id number.

How can I partially match the selector with html_nodes so I get all the ones that start with #post_message regardless of how they end?

Or maybe I should ask a more general question. How should I go about scraping the page, if I want to be able to attribute authors to the messages and keep track of message order.

Thanks.

library(rvest)
html <- html("http://www.acme.com/forums/new_rules_28429/")
cast <- html_nodes(html, "#post_message_28429")
cast

> <div id="post_message_28429">&#13;            &#13;           Thanks for posting
> this.&#13;        </div> 
> 
> attr(,"class")

[1] "XMLNodeSet"

MrFlick · Accepted Answer

Rather than using a css selector, use an xpath selector which has a starts-with() function

cast <- html_nodes(html, xpath="//div[starts-with(@id,'post_message')]")

how to scrape messages from web based forums with rvest

Tags:

r

rvest

variable

1 Answers

MrFlick

Recent Activity

Donate For Us

how to scrape messages from web based forums with rvest

Tags:

r

rvest

variable

1 Answers

MrFlick

Related questions

Recent Activity

Donate For Us