Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C#.net Use HTMLDocument from Console?

Tags:

c#

.net

console

I'm trying to use System.Windows.Forms.HTMLDocument in a console application. First, is this even possible? If so, how can I load up a page from the web into it? I was trying to use WebBrowser, but it's telling me:

Unhandled Exception: System.Threading.ThreadStateException: ActiveX control '885 6f961-340a-11d0-a96b-00c04fd705a2' cannot be instantiated because the current th read is not in a single-threaded apartment.

There seems to be a severe lack of tutorials on the HTMLDocument object (or Google is just turning up useless results).


Just discovered mshtml.HTMLDocument.createDocumentFromUrl, but that throws me

Unhandled Exception: System.Runtime.InteropServices.COMException (0x80010105): T he server threw an exception. (Exception from HRESULT: 0x80010105 (RPC_E_SERVERF AULT)) at System.RuntimeType.ForwardCallToInvokeMember(String memberName, BindingFla gs flags, Object target, Int32[] aWrapperTypes, MessageData& msgData) at mshtml.HTMLDocumentClass.createDocumentFromUrl(String bstrUrl, String bstr Options) at iget.Program.Main(String[] args)

What the heck? All I want is a list of <a> tags on a page. Why is this so hard?


For those that are curious, here's the solution I came up with, thanks to TrueWill:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.IO;
using HtmlAgilityPack;

namespace iget
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient wc = new WebClient();
            HtmlDocument doc = new HtmlDocument();
            doc.Load(wc.OpenRead("http://google.com"));
            foreach(HtmlNode a in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
                Console.WriteLine(a.Attributes["href"].Value);
            }
        }
    }
}
like image 608
mpen Avatar asked Feb 28 '23 00:02

mpen


2 Answers

As an alternative, you could use the free Html Agility Pack library. That can parse HTML and will let you query it with LINQ. I used an older version for a project at home and it worked great.

EDIT: You may also want to use the WebClient or WebRequest classes to download the web page. See my blog post on Web scraping in .NET. (Note that I haven't tried this in a console app.)

like image 94
TrueWill Avatar answered Mar 07 '23 02:03

TrueWill


Add the [STAThread] attribute to your Main method

    [STAThread]
    static void Main(string[] args)
    {
    }

That should fix it.

like image 26
chris.w.mclean Avatar answered Mar 07 '23 02:03

chris.w.mclean