I need a Powerful Web Scraper library [closed]

Tags:

I need a powerful web scraper library for mining contents from web. That can be paid or free both will be fine for me. Please suggest me a library or better way for mining the data and store in my preferred database. I have searched but i didn't find any good solution for this. I need a good suggestion from experts. Please help me out.

871

asked Dec 07 '10 13:12

Pankaj Mishra

2 Answers

Scraping is easy really, you just have to parse the content you are downloading and get all the associated links.

The most important piece though is the part that processes the HTML. Because most browsers don't require the cleanest (or standards-compliant) HTML in order to be rendered, you need an HTML parser that is going to be able to make sense of HTML that is not always well-formed.

I recommend you use the HTML Agility Pack for this purpose. It does very well at handling non-well-formed HTML, and provides an easy interface for you to use XPath queries to get nodes in the resulting document.

Beyond that, you just need to pick a data store to hold your processed data (you can use any database technology for that) and a way to download content from the web, which .NET provides two high-level mechanisms for, the WebClient and HttpWebRequest/HttpWebResponse classes.

answered Sep 23 '22 17:09

casperOne

using System; using System.Collections.Generic; using System.Linq; using System.Text;  namespace SoftCircuits.Parsing {     public class HtmlTag     {         /// <summary>         /// Name of this tag         /// </summary>         public string Name { get; set; }          /// <summary>         /// Collection of attribute names and values for this tag         /// </summary>         public Dictionary<string, string> Attributes { get; set; }          /// <summary>         /// True if this tag contained a trailing forward slash         /// </summary>         public bool TrailingSlash { get; set; }          /// <summary>         /// Indicates if this tag contains the specified attribute. Note that         /// true is returned when this tag contains the attribute even when the         /// attribute has no value         /// </summary>         /// <param name="name">Name of attribute to check</param>         /// <returns>True if tag contains attribute or false otherwise</returns>         public bool HasAttribute(string name)         {             return Attributes.ContainsKey(name);         }     };      public class HtmlParser : TextParser     {         public HtmlParser()         {         }          public HtmlParser(string html) : base(html)         {         }          /// <summary>         /// Parses the next tag that matches the specified tag name         /// </summary>         /// <param name="name">Name of the tags to parse ("*" = parse all tags)</param>         /// <param name="tag">Returns information on the next occurrence of the specified tag or null if none found</param>         /// <returns>True if a tag was parsed or false if the end of the document was reached</returns>         public bool ParseNext(string name, out HtmlTag tag)         {             // Must always set out parameter             tag = null;              // Nothing to do if no tag specified             if (String.IsNullOrEmpty(name))                 return false;              // Loop until match is found or no more tags             MoveTo('<');             while (!EndOfText)             {                 // Skip over opening '<'                 MoveAhead();                  // Examine first tag character                 char c = Peek();                 if (c == '!' && Peek(1) == '-' && Peek(2) == '-')                 {                     // Skip over comments                     const string endComment = "-->";                     MoveTo(endComment);                     MoveAhead(endComment.Length);                 }                 else if (c == '/')                 {                     // Skip over closing tags                     MoveTo('>');                     MoveAhead();                 }                 else                 {                     bool result, inScript;                      // Parse tag                     result = ParseTag(name, ref tag, out inScript);                     // Because scripts may contain tag characters, we have special                     // handling to skip over script contents                     if (inScript)                         MovePastScript();                     // Return true if requested tag was found                     if (result)                         return true;                 }                 // Find next tag                 MoveTo('<');             }             // No more matching tags found             return false;         }          /// <summary>         /// Parses the contents of an HTML tag. The current position should be at the first         /// character following the tag's opening less-than character.         ///          /// Note: We parse to the end of the tag even if this tag was not requested by the         /// caller. This ensures subsequent parsing takes place after this tag         /// </summary>         /// <param name="reqName">Name of the tag the caller is requesting, or "*" if caller         /// is requesting all tags</param>         /// <param name="tag">Returns information on this tag if it's one the caller is         /// requesting</param>         /// <param name="inScript">Returns true if tag began, and did not end, and script         /// block</param>         /// <returns>True if data is being returned for a tag requested by the caller         /// or false otherwise</returns>         protected bool ParseTag(string reqName, ref HtmlTag tag, out bool inScript)         {             bool doctype, requested;             doctype = inScript = requested = false;              // Get name of this tag             string name = ParseTagName();              // Special handling             if (String.Compare(name, "!DOCTYPE", true) == 0)                 doctype = true;             else if (String.Compare(name, "script", true) == 0)                 inScript = true;              // Is this a tag requested by caller?             if (reqName == "*" || String.Compare(name, reqName, true) == 0)             {                 // Yes                 requested = true;                 // Create new tag object                 tag = new HtmlTag();                 tag.Name = name;                 tag.Attributes = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);             }              // Parse attributes             MovePastWhitespace();             while (Peek() != '>' && Peek() != NullChar)             {                 if (Peek() == '/')                 {                     // Handle trailing forward slash                     if (requested)                         tag.TrailingSlash = true;                     MoveAhead();                     MovePastWhitespace();                     // If this is a script tag, it was closed                     inScript = false;                 }                 else                 {                     // Parse attribute name                     name = (!doctype) ? ParseAttributeName() : ParseAttributeValue();                     MovePastWhitespace();                     // Parse attribute value                     string value = String.Empty;                     if (Peek() == '=')                     {                         MoveAhead();                         MovePastWhitespace();                         value = ParseAttributeValue();                         MovePastWhitespace();                     }                     // Add attribute to collection if requested tag                     if (requested)                     {                         // This tag replaces existing tags with same name                         if (tag.Attributes.ContainsKey(name))                             tag.Attributes.Remove(name);                         tag.Attributes.Add(name, value);                     }                 }             }             // Skip over closing '>'             MoveAhead();              return requested;         }          /// <summary>         /// Parses a tag name. The current position should be the first character of the name         /// </summary>         /// <returns>Returns the parsed name string</returns>         protected string ParseTagName()         {             int start = Position;             while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>')                 MoveAhead();             return Substring(start, Position);         }          /// <summary>         /// Parses an attribute name. The current position should be the first character         /// of the name         /// </summary>         /// <returns>Returns the parsed name string</returns>         protected string ParseAttributeName()         {             int start = Position;             while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>' && Peek() != '=')                 MoveAhead();             return Substring(start, Position);         }          /// <summary>         /// Parses an attribute value. The current position should be the first non-whitespace         /// character following the equal sign.         ///          /// Note: We terminate the name or value if we encounter a new line. This seems to         /// be the best way of handling errors such as values missing closing quotes, etc.         /// </summary>         /// <returns>Returns the parsed value string</returns>         protected string ParseAttributeValue()         {             int start, end;             char c = Peek();             if (c == '"' || c == '\'')             {                 // Move past opening quote                 MoveAhead();                 // Parse quoted value                 start = Position;                 MoveTo(new char[] { c, '\r', '\n' });                 end = Position;                 // Move past closing quote                 if (Peek() == c)                     MoveAhead();             }             else             {                 // Parse unquoted value                 start = Position;                 while (!EndOfText && !Char.IsWhiteSpace(c) && c != '>')                 {                     MoveAhead();                     c = Peek();                 }                 end = Position;             }             return Substring(start, end);         }          /// <summary>         /// Locates the end of the current script and moves past the closing tag         /// </summary>         protected void MovePastScript()         {             const string endScript = "</script";              while (!EndOfText)             {                 MoveTo(endScript, true);                 MoveAhead(endScript.Length);                 if (Peek() == '>' || Char.IsWhiteSpace(Peek()))                 {                     MoveTo('>');                     MoveAhead();                     break;                 }             }         }     } }

answered Sep 25 '22 17:09

Kutta

Related questions
                            
                                Is it possible to query Entity Framework before calling DbContext.SaveChanges?
                            
                                How can I use "Azure File Storage" with Web App Service?
                            
                                Is there a difference between lambdas declared with and without async
                            
                                Identityserver 4 and Azure AD
                            
                                C# - What is a component and how is it typically used?
                            
                                How can I add a Trace() to every method call in C#?
                            
                                Why not use Html.EditorForModel()
                            
                                Getting PdfStamper to work with MemoryStreams (c#, itextsharp)
                            
                                How to reference assemblies using Visual Studio Code?
                            
                                Refresh dependencies raises: Could Not Be Completed. App called interface marshalled for different thread
                            
                                ReactiveUI and Caliburn Micro together?
                            
                                Using C# dll in C++ code
                            
                                When to return IHttpActionResult vs Object
                            
                                When is array allocated on stack in c#?
                            
                                How to get the z-order in windows?
                            
                                Entity Framework Complex Type vs Creating new Entity
                            
                                Does "default" serialization in C# serialize static fields?
                            
                                WPF MVVM Modal Overlay Dialog only over a View (not Window)
                            
                                Why does C# not allow generic properties?
                            
                                Understanding ForeignKey attribute in entity framework code first

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

I need a Powerful Web Scraper library [closed]

Tags:

c#

.net

web-scraping

web-crawler

Pankaj Mishra

People also ask

2 Answers

casperOne

Kutta

Recent Activity

Donate For Us