Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving a Web Page with all its content in C#

I am trying to save a web page (just like we do in browsers) along with all its content and formatting. I tried WebClient, WebRequest examples but they can only download the text part and sometimes javascript. But no css and images etc. Is there any api for this in .Net, or any 3rd party api for .net?

It is possible, I think it because a lot applications are running for offline reading, and they show the saved pages with the same formatting and styling. How it is done? Any ideas ?

EDIT 1: Web pages can be parsed and saved using HtmlAgilityPack. But is there any way to get the main article and other contents like ads, other external links separated. Is there any way to differentiate between the contents which are relevant and which are not? (I am sorry, if this question is not clear).

Also can any one give some suggestion that how these offline reading applications (like read later/pocket etc) save a web page and format it.

Is there any way to do the same in C#?

like image 939
Deeps Avatar asked Dec 09 '25 21:12

Deeps


2 Answers

You can download a Page text as Html, then parse it and get <link rel="stylesheet" type="text/css" href="..."> or <img src="..."/> elements and download link of attributes like href or src separately.

HtmlAgilityPack is a reliable and useful library for parsing Htmls.

like image 107
Ria Avatar answered Dec 12 '25 11:12

Ria


You can use Wget

https://www.gnu.org/software/wget/manual/html_node/Recursive-Download.html#Recursive-Download

like image 31
x2. Avatar answered Dec 12 '25 11:12

x2.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!