Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to design a crawl bot?

I'm working on a little project to analyze the content on some sites I find interesting; this is a real DIY project that I'm doing for my entertainment/enlightenment, so I'd like to code as much of it on my own as possible.

Obviously, I'm going to need data to feed my application, and I was thinking I would write a little crawler that would take maybe 20k pages of html and write them to text files on my hard drive. However, when I took a look on SO and other sites, I couldn't find any information on how to do this. Is it feasible? It seems like there are open-source options available (webpshinx?), but I would like to write this myself if possible.

Scheme is the only language I know well, but I thought I'd take use this project to learn myself some some Java, so I'd be interested if there are any racket or java libraries that would be helpful for this.

So I guess to summarize my question, what are some good resources to get started on this? How can I get my crawler to request info from other servers? Will I have to write a simple parser for this, or is that unnecessary given I want to take the whole html file and save it as txt?

like image 742
Pseudo-Gorgias Avatar asked Jan 20 '12 01:01

Pseudo-Gorgias


2 Answers

This is entirely feasible, and you can definitely do it with Racket. You may want to take a look at the PLaneT libraries; In particular, Neil Van Dyke's HtmlPrag:

http://planet.racket-lang.org/display.ss?package=htmlprag.plt&owner=neil

.. is probably the place to start. You should be able to pull the content a web page into a parsed format in one or two lines of code.

Let me know if you have any questions about this.

like image 98
John Clements Avatar answered Sep 21 '22 18:09

John Clements


Having done this myself in Racket, here is what I would suggest.

Start with a "Unix tools" approach:

  • Use curl to do the work of downloading each page (you can execute it from Racket using system) and storing the output in a temporary file.
  • Use Racket to extract the URIs from the <a> tags.
    • You can "cheat" and do a regular expression string search.
    • Or, do it "the right way" with a true HTML parser, as John Clements' great answer explains.
    • Consider maybe doing the cheat first, then looping back later to do it the right way.

At this point you could stop, or, you could go back and replace curl with your own code to do the downloads. For this you can use Racket's net/url module.

Why I suggest trying curl, first, is that it helps you do something more complicated than it might seem:

  • Do you want to follow 30x redirects?
  • Do you want to accept/store/provide cookies (the site may behave differently otherwise)?
  • Do you want to use HTTP keep-alive?
  • And on and on.

Using curl for example like this:

(define curl-core-options
  (string-append
   "--silent "
   "--show-error "
   "--location "
   "--connect-timeout 10 "
   "--max-time 30 "
   "--cookie-jar " (path->string (build-path 'same "tmp" "cookies")) " "
   "--keepalive-time 60 "
   "--user-agent 'my crawler' "
   "--globoff " ))

(define (curl/head url out-file)
  (system (format "curl ~a --head --output ~a --url \"~a\""
                   curl-core-options
                   (path->string out-file)
                   url)))

(define (curl/get url out-file)
  (system (format "curl ~a --output ~a --url \"~a\""
                  curl-core-options
                  (path->string out-file)
                  url)))

represents is a lot of code that you would otherwise need to write from scratch in Racket. To do all the things those curl command line flags are doing for you.

In short: Start with the simplest case of using existing tools. Use Racket almost as a shell script. If that's good enough for you, stop. Otherwise go on to replace the tools one by one with your bespoke code.

like image 25
Greg Hendershott Avatar answered Sep 24 '22 18:09

Greg Hendershott