To be precise, I have a class, say A, that I select via html_nodes in rvest. Now A can have many sub-classes and lots of html tags such as links and img tags. I want to drop some particular classes & tags from A while scraping the rest of the data. I do not know the classes for the rest of the data. I do know what I want to blacklist.
The HTML (hypothetical). This tag, <div class="messageContent"> is repeated up to 25 times in the document, with differing content, but the same structure.
<div class="messageContent">
<article>
<blockquote class="messageText SelectQuoteContainer ugc baseHtml">
<div class="bbCodeBlock bbCodeQuote" data-author="Generic">
<aside>
<div class="attribution type">Generic said:
<a href="goto/post?id=32554#post-32754" class="AttributionLink">↑</a>
</div>
<blockquote class="quoteContainer"><div class="quote">I see what you did there.</div><div class="quoteExpand">Click to expand...</div></blockquote>
</aside>
</div><img src="styles/default/xenforo/clear.png" class="mceSmilieSprite mceSmilie9" alt=":o" title="Eek! :o"/> Really?
<aside>
<div class="attribution type">Generic said:
<a href="goto/post?id=32554#post-32754" class="AttributionLink">↑</a>
</div>
<blockquote class="quoteContainer"><div class="quote">I see what you did there.</div><div class="quoteExpand">Click to expand...</div></blockquote>
</aside>
<div class="messageTextEndMarker"> </div>
</blockquote>
</article>
</div>
SO, the page I'm scraping contains multiple such classes. I do
posts <- page %>% html_nodes(".messageContent")
This gives me a list of 25 html nodes, each containing variations of the aforementioned html content.
I want to remove everything within the <aside> & </aside> tags (which can occur at multiple places in the post), and capture the rest of the html via html_text() %>% as.character()
Can I do this with rvest?
Testing out @Mirosław Zalewski 's solution.
test<- page %>% html_node(".messageContent") %>%
html_nodes(xpath='//*[not(ancestor::aside or name()="aside")]/text()')
This returned All of the elements of the page that were not within aside. A bit of fine-tuning, led me to:
page %>% html_nodes(xpath='(//div[@class="messageContent"])[1]//*[not(ancestor::aside or name()="aside")]/text()') %>% html_text() %>% as.character()
Iterated over the 25 classes, this gives me exactly what I need.
Using XPath, you can select all nodes that are not <aside> or descendants of <aside>:
page %>% html_node(".messageContent") %>%
html_nodes(xpath='//*[not(ancestor::aside or name()="aside")]')
Unfortunately, this will match also elements that contain <aside>. If you pass that to html_text(), it will return <aside> text content anyway.
This can be overcome by adding another condition into query. One good candidate of such condition is "everything that is text node":
page %>% html_node(".messageContent") %>%
html_nodes(xpath='//*[not(ancestor::aside or name()="aside")]/text()')
Actually, /text() will return only text nodes, which pretty much allows you to skip html_text() call entirely. But since many text nodes are dubious (contain only whitespace characters) and this function has trim built-in, you might consider calling it anyway.
Please note that this solution will also skip any non-text content, such as image references (probably including emotes). Your original proposal would do that as well, but it is unclear to me whether you had intended that or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With