Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I scrape only the <body> tag off of a website

I'm working on a webcrawler. At the moment i scrape the whole content and then using regular expression i remove <meta>, <script>, <style> and other tags and get the content of the body.

However, I'm trying to optimise the performance and I was wondering if there's a way I could scrape only the <body> of the page?

namespace WebScraper
{
    public static class KrioScraper
    {    
        public static string scrapeIt(string siteToScrape)
        {
            string HTML = getHTML(siteToScrape);
            string text = stripCode(HTML);
            return text;
        }

        public static string getHTML(string siteToScrape)
        {
            string response = "";
            HttpWebResponse objResponse;
            HttpWebRequest objRequest = 
                (HttpWebRequest) WebRequest.Create(siteToScrape);
            objRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; " +
                "Windows NT 5.1; .NET CLR 1.0.3705)";
            objResponse = (HttpWebResponse) objRequest.GetResponse();
            using (StreamReader sr =
                new StreamReader(objResponse.GetResponseStream()))
            {
                response = sr.ReadToEnd();
                sr.Close();
            }
            return response;
        }

        public static string stripCode(string the_html)
        {
            // Remove google analytics code and other JS
            the_html = Regex.Replace(the_html, "<script.*?</script>", "", 
                RegexOptions.Singleline | RegexOptions.IgnoreCase);
            // Remove inline stylesheets
            the_html = Regex.Replace(the_html, "<style.*?</style>", "", 
                RegexOptions.Singleline | RegexOptions.IgnoreCase);
            // Remove HTML tags
            the_html = Regex.Replace(the_html, "</?[a-z][a-z0-9]*[^<>]*>", "");
            // Remove HTML comments
            the_html = Regex.Replace(the_html, "<!--(.|\\s)*?-->", "");
            // Remove Doctype
            the_html = Regex.Replace(the_html, "<!(.|\\s)*?>", "");
            // Remove excessive whitespace
            the_html = Regex.Replace(the_html, "[\t\r\n]", " ");

            return the_html;
        }
    }
}

From Page_Load I call the scrapeIt() method passing to it the string that I get from a textbox from the page.

like image 219
Johancho Avatar asked Aug 16 '11 17:08

Johancho


2 Answers

Still the simplest/fastest (least accurate) method.

int start = response.IndexOf("<body", StringComparison.CurrentCultureIgnoreCase);
int end = response.LastIndexOf("</body>", StringComparison.CurrentCultureIgnoreCase);
return response.Substring(start, end-start + "</body>".Length);

Obviously if there's javascript in the HEAD tag like...

document.write("<body>");

Then you'll end up with a little more then you wanted.

like image 189
Louis Ricci Avatar answered Oct 27 '22 00:10

Louis Ricci


I'd suggest taking advantage of the HTML Agility Pack to do the HTML parsing/manipulation.

You can easily select the body like this:

var webGet = new HtmlWeb();
var document = webGet.Load(url);
document.DocumentNode.SelectSingleNode("//body")
like image 35
Joel Beckham Avatar answered Oct 27 '22 00:10

Joel Beckham