C# - Best Approach to Parsing Webpage?

Tags:

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. What's the best way to do this?

I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well.

Are regular expressions the best way to achieve what I'm trying to accomplish?

398

asked Nov 18 '08 21:11

MattSayar

2 Answers

I can recommend the HTML Agility Pack. I've used it in a few cases where I needed to parse HTML and it works great. Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there).

HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;

answered Sep 21 '22 02:09

Jeff Donnici

Regular expressions are one way to do it, but it can be problematic.

Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate.

You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need.

UPDATE

At the time of this update I've received 15 up and 9 downvotes. I think that maybe people aren't reading the question nor the comments on this answer. All the OP wanted to do was grab the href values. That's it. From that perspective, a simple regex is just fine. If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best.

answered Sep 22 '22 02:09

NotMe

Related questions
                            
                                Filter a String
                            
                                Select max age C#
                            
                                Splitting/Combining Partial Methods
                            
                                Serializing WITHOUT xmlns
                            
                                Capture username with log4net
                            
                                Convert datatype 'long' to byte array
                            
                                How to do recursive descent of json using json.net?
                            
                                Add installer says Service name contains invalid characters, is empty, or is too long
                            
                                How to Change ASP.NET MVC Controller Name in URL?
                            
                                How to get the GET Query Parameters in a simple way on Azure Functions C#?
                            
                                Is there a synchronization class that guarantee FIFO order in C#?
                            
                                C# get and set the high order word of an integer
                            
                                String.IsNullOrEmpty() Check for Space
                            
                                How to use applicationSettings in the new web.config configuration in VS2010?
                            
                                Unreachable code detected in case statement
                            
                                Timer run every 5th minute
                            
                                Remove hours:seconds:milliseconds in DateTime object
                            
                                What are Independent Associations and Foreign Key Associations? [duplicate]
                            
                                How to add custom error message with “required” htmlattribute to mvc 5 razor view text input editor
                            
                                ASP Core Cannot Set User Secrets in VS 2017

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

C# - Best Approach to Parsing Webpage?

Tags:

html

c#

xml

html-content-extraction

MattSayar

People also ask

2 Answers

Jeff Donnici

NotMe

Recent Activity

Donate For Us