Take a vbulletin site like the one in the example. I want to be able to scrape just the text messages from the threads. However the css selectors for the messages are called #post_message_xxx where xxx is a variable id number.
How can I partially match the selector with html_nodes so I get all the ones that start with #post_message regardless of how they end?
Or maybe I should ask a more general question. How should I go about scraping the page, if I want to be able to attribute authors to the messages and keep track of message order.
Thanks.
library(rvest)
html <- html("http://www.acme.com/forums/new_rules_28429/")
cast <- html_nodes(html, "#post_message_28429")
cast
> <div id="post_message_28429"> Thanks for posting
> this. </div>
>
> attr(,"class")
[1] "XMLNodeSet"
Rather than using a css selector, use an xpath selector which has a starts-with() function
cast <- html_nodes(html, xpath="//div[starts-with(@id,'post_message')]")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With