Would it be possible to write a screen-scraper for a website protected by a form login. I have access to the site, of course, but I have no idea how to login to the site and save my credentials in C#.
Also, any good examples of screenscrapers in C# would be hugely appreciated.
Has this already been done?
Yes, it's login screens. Sometimes, you might set your sights on scraping data you can access only after you log into an account. It could be your channel analytics, your user history, or any other type of information you need. In this case, first check if the company provides an API for the purpose.
As you saw in this tutorial, C++, which is normally used for system programming, also works well for web scraping because of its ability to parse HTTP.
Web Scraping Past Login ScreensParseHub is a free and powerful web scraper that can log in to any site before it starts scraping data. You can then set it up to extract the specific data you want and download it all to an Excel or JSON file. To get started, make sure you download and install ParseHub for free.
It's pretty simple. You need your custom login (HttpPost) method.
You can come up with something like this (in this way you will get all needed cookies after login, and you need just to pass them to the next HttpWebRequest):
public static HttpWebResponse HttpPost(String url, String referer, String userAgent, ref CookieCollection cookies, String postData, out WebHeaderCollection headers, WebProxy proxy)
{
try
{
HttpWebRequest http = WebRequest.Create(url) as HttpWebRequest;
http.Proxy = proxy;
http.AllowAutoRedirect = true;
http.Method = "POST";
http.ContentType = "application/x-www-form-urlencoded";
http.UserAgent = userAgent;
http.CookieContainer = new CookieContainer();
http.CookieContainer.Add(cookies);
http.Referer = referer;
byte[] dataBytes = UTF8Encoding.UTF8.GetBytes(postData);
http.ContentLength = dataBytes.Length;
using (Stream postStream = http.GetRequestStream())
{
postStream.Write(dataBytes, 0, dataBytes.Length);
}
HttpWebResponse httpResponse = http.GetResponse() as HttpWebResponse;
headers = http.Headers;
cookies.Add(httpResponse.Cookies);
return httpResponse;
}
catch { }
headers = null;
return null;
}
Sure, this has been done. I have done it a couple of times. This is (generically) called Screen-scraping or Web Scraping.
You should take a look at this question (and also browse the questions under the tag "screen-scraping". Note that Scraping does not only relate to data extraction from a web resource. It also involves submission of data to online forms so as mimic the actions of a user when submitting input such as a Login form.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With