Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# .NET: Scraping dynamic (JS) websites

After hours of fails, I am coming here. I need to scrape a dynamically generated webpage (made using Vue.JS, but I would prefer not to share the link).

I have tried multiple approaches (1, 2, 3). None of them works on this webpage.

The most promising solution was using Selenium and PhantomJS. I tried it like this and I'm not sure why it's not even working for Google:

private void button1_Click(object sender, EventArgs e) {
        PhantomJSDriverService service = PhantomJSDriverService.CreateDefaultService();
        service.IgnoreSslErrors = true;
        service.LoadImages = false;
        service.ProxyType = "none";

        var driver = new PhantomJSDriver(service); // I also tried: new PhantomJSDriver();
        driver.Manage().Timeouts().PageLoad = TimeSpan.FromSeconds(10);
        driver.Url = "https://google.com";
        driver.Navigate();

        var source = driver.PageSource;
        textBox1.AppendText(source);
}

Did not work:

enter image description here

I also tried with a WebBrowser Control, but the page never fully loads:

(EDIT: I found out WebBrowser just instantiates IE, and after trying to open the target website in standalone IE browser, the webpage also never loads completely, so it makes sense to see the same behaviour inside WebView. I think I am bound to Selenium&PhantomJS due to this fact.)

enter image description here

Surely this shouldn't be so complicated. How to do it properly?

like image 939
c0dehunter Avatar asked Nov 08 '22 05:11

c0dehunter


1 Answers

if you need to scrape a website you can use ScrapySharp scraping framework. You can add it to a project as a nuget. https://www.nuget.org/packages/ScrapySharp/

Install-Package ScrapySharp -Version 2.6.2

It has many useful properties to access different elements on the page.For example to access the entire HTML of the page you can use the following:

        ScrapingBrowser Browser = new ScrapingBrowser();
        WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
        HtmlNode rawHTML = PageResult.Html;
        Console.WriteLine(rawHTML.InnerHtml);
        Console.ReadLine();
like image 62
ashish Avatar answered Nov 12 '22 20:11

ashish