Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use ScrapySharp to parse elements in an html document?

Here's the project official "Documentation":

https://bitbucket.org/rflechner/scrapysharp/wiki/Home


No matter what I try, I can't find the CssSelect() method that the library is supposed to add to make querying things easier. Here's what I've tried:

using ScrapySharp.Core;
using ScrapySharp.Html.Parsing;
using HtmlAgilityPack;

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.stackoverflow.com");

var page = doc.DocumentNode.SelectSingleNode("//body");
page.CssSel???

Exactly how do I use this library? In the documentation it isn't clear what type html is.

like image 893
sergserg Avatar asked Mar 31 '13 01:03

sergserg


People also ask

How to use the HTML parser to scrape content for NY Times?

To answer this, let's use an example. Go to the website NYTimes and right click on the page. Select View page source or simply press the keys Ctrl + u on your keyboard. A new page opens containing a number of links, HTML tags, and content. This is the source from which the HTML Parser scrapes content for NYTimes!

Why use scrapysharp for web scraping?

Increasing efficiency and reducing our project’s costs. ScrapySharp is an open-source web scraping library designed for c#. It also includes a web client to simulate a browser’s behavior (perfect for scraping dynamic pages or event-triggered content) and an HTMLAgilityPack extension (for selecting elements using CSS selectors).

How do I pick elements from a page using a scraper?

We can use the same logic to pick elements from the page using our scraper by defining the element + class (‘a.className’) or element + ID (‘a#idName’) An alternative to CSS selectors is using the XPath of the element. XML Path (XPath) uses xpath expressions to select nodes from an XML or HTML document.

How to scrape data from a website using HTML?

Also, we can target the href attribute to get the URL; this is especially important for storing the data source or following paginations. To take a look at the HTML structure of a website, hit Ctrl/Command + Shift + C (or right-click and hit inspect) on the page you want to scrape. We’re now inside the Inspector or the browser’s Developer Tools.


1 Answers

Add

using ScrapySharp.Extensions;

It looks like you're missing that. That should make CssSelect available.

Just in case an example helps, here's a method, as well, that I use in a project:

private string GetPdfUrl(HtmlDocument document, string baseUrl)
{
    return new Uri(new Uri(baseUrl), document.DocumentNode.CssSelect(".table-of-content .head-row td.download a.text-pdf").Single().Attributes["href"].Value).ToString();
}
like image 183
Ben Allred Avatar answered Sep 30 '22 17:09

Ben Allred