I'm working on a little project to analyze the content on some sites I find interesting; this is a real DIY project that I'm doing for my entertainment/enlightenment, so I'd like to code as much of it on my own as possible.
Obviously, I'm going to need data to feed my application, and I was thinking I would write a little crawler that would take maybe 20k pages of html and write them to text files on my hard drive. However, when I took a look on SO and other sites, I couldn't find any information on how to do this. Is it feasible? It seems like there are open-source options available (webpshinx?), but I would like to write this myself if possible.
Scheme is the only language I know well, but I thought I'd take use this project to learn myself some some Java, so I'd be interested if there are any racket or java libraries that would be helpful for this.
So I guess to summarize my question, what are some good resources to get started on this? How can I get my crawler to request info from other servers? Will I have to write a simple parser for this, or is that unnecessary given I want to take the whole html file and save it as txt?
This is entirely feasible, and you can definitely do it with Racket. You may want to take a look at the PLaneT libraries; In particular, Neil Van Dyke's HtmlPrag:
http://planet.racket-lang.org/display.ss?package=htmlprag.plt&owner=neil
.. is probably the place to start. You should be able to pull the content a web page into a parsed format in one or two lines of code.
Let me know if you have any questions about this.
Having done this myself in Racket, here is what I would suggest.
Start with a "Unix tools" approach:
curl
to do the work of downloading each page (you can execute it from Racket using system
) and storing the output in a temporary file.<a>
tags.
At this point you could stop, or, you could go back and replace curl
with your own code to do the downloads. For this you can use Racket's net/url
module.
Why I suggest trying curl
, first, is that it helps you do something more complicated than it might seem:
Using curl
for example like this:
(define curl-core-options
(string-append
"--silent "
"--show-error "
"--location "
"--connect-timeout 10 "
"--max-time 30 "
"--cookie-jar " (path->string (build-path 'same "tmp" "cookies")) " "
"--keepalive-time 60 "
"--user-agent 'my crawler' "
"--globoff " ))
(define (curl/head url out-file)
(system (format "curl ~a --head --output ~a --url \"~a\""
curl-core-options
(path->string out-file)
url)))
(define (curl/get url out-file)
(system (format "curl ~a --output ~a --url \"~a\""
curl-core-options
(path->string out-file)
url)))
represents is a lot of code that you would otherwise need to write from scratch in Racket. To do all the things those curl
command line flags are doing for you.
In short: Start with the simplest case of using existing tools. Use Racket almost as a shell script. If that's good enough for you, stop. Otherwise go on to replace the tools one by one with your bespoke code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With