Here's the project official "Documentation":
https://bitbucket.org/rflechner/scrapysharp/wiki/Home
No matter what I try, I can't find the CssSelect()
method that the library is supposed to add to make querying things easier. Here's what I've tried:
using ScrapySharp.Core;
using ScrapySharp.Html.Parsing;
using HtmlAgilityPack;
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.stackoverflow.com");
var page = doc.DocumentNode.SelectSingleNode("//body");
page.CssSel???
Exactly how do I use this library? In the documentation it isn't clear what type html
is.
To answer this, let's use an example. Go to the website NYTimes and right click on the page. Select View page source or simply press the keys Ctrl + u on your keyboard. A new page opens containing a number of links, HTML tags, and content. This is the source from which the HTML Parser scrapes content for NYTimes!
Increasing efficiency and reducing our project’s costs. ScrapySharp is an open-source web scraping library designed for c#. It also includes a web client to simulate a browser’s behavior (perfect for scraping dynamic pages or event-triggered content) and an HTMLAgilityPack extension (for selecting elements using CSS selectors).
We can use the same logic to pick elements from the page using our scraper by defining the element + class (‘a.className’) or element + ID (‘a#idName’) An alternative to CSS selectors is using the XPath of the element. XML Path (XPath) uses xpath expressions to select nodes from an XML or HTML document.
Also, we can target the href attribute to get the URL; this is especially important for storing the data source or following paginations. To take a look at the HTML structure of a website, hit Ctrl/Command + Shift + C (or right-click and hit inspect) on the page you want to scrape. We’re now inside the Inspector or the browser’s Developer Tools.
Add
using ScrapySharp.Extensions;
It looks like you're missing that. That should make CssSelect
available.
Just in case an example helps, here's a method, as well, that I use in a project:
private string GetPdfUrl(HtmlDocument document, string baseUrl)
{
return new Uri(new Uri(baseUrl), document.DocumentNode.CssSelect(".table-of-content .head-row td.download a.text-pdf").Single().Attributes["href"].Value).ToString();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With