I have a web browser, and a label in Visual Studio
, and basically what I'm trying to do is grab a section from another webpage.
I tried using WebClient.DownloadString
and WebClient.DownloadFile
, and both of them give me the source code of the web page before the JavaScript loads the content. My next idea was to use a web browser tool and just call webBrowser.DocumentText
after the page loaded and that did not work, it still gives me the original source of the page.
Is there a way I can grab the page post JavaScript
load?
Web scraping with JavaScript is a very useful technique to extract data from the Internet for presentation or analysis.
C#, and . NET in general, have all the necessary tools and libraries for you to implement your own data scraper, and especially with tools like Puppeteer and Selenium it is easy to quickly implement a crawler project and get the data you want.
There are two approaches to scraping a dynamic webpage: Scrape the content directly from the JavaScript. Scrape the website as we view it in our browser — using Python packages capable of executing the JavaScript.
The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.
At a high-level, these are the steps:
Install-Package Selenium.WebDriver
Here is an example usage of the phantomjs webdriver:
var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);
var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
options.ToCapabilities(),
TimeSpan.FromSeconds(3)
);
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");
More info on selenium, phantomjs and webdriver can be found at the following links:
http://docs.seleniumhq.org/
http://docs.seleniumhq.org/projects/webdriver/
http://phantomjs.org/
EDIT: Easier Method
It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):
Install web driver:
Install-Package Selenium.WebDriver
Install embedded exe:
Install-Package phantomjs.exe
Updated code:
var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");
Thanks to wbennet
, I discovered PhantomJSCloud.com. Enough free service to scrap pages through web API
calls.
public static string GetPagePhantomJs(string url)
{
using (var client = new System.Net.Http.HttpClient())
{
client.DefaultRequestHeaders.ExpectContinue = false;
var pageRequestJson = new System.Net.Http.StringContent
(@"{'url':'" + url + "','renderType':'html','outputAsJson':false }");
var response = client.PostAsync
("https://PhantomJsCloud.com/api/browser/v2/{YOUR_API_KEY}/",
pageRequestJson).Result;
return response.Content.ReadAsStringAsync().Result;
}
}
Yeah.
ok i will show you how to enable javascript using phantomjs and selenuim with c#
in your main function type this code
var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled", true);
IWebDriver driver = new PhantomJSDriver("phantomjs Folder Path", options);
driver.Navigate().GoToUrl("https://www.yourwebsite.com/");
try
{
string pagesource = driver.PageSource;
driver.FindElement(By.Id("yourelement"));
Console.Write("yourelement founded");
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
Console.Read();
don't forget to put yourwebsite and the element that you loooking for and the phantomjs.exe path in you machine in this code below
have great time of coding and thanks wbennett
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With