I created a little web spider in Python which I'm using to collect URLs. I'm not interested in the content. Right now I'm keeping all the visited URLs in a set in memory, because I don't want my spider to visit URLs twice. Of course that's a very limited way of accomplishing this.
So what's the best way to keep track of my visited URLs?
Should I use a database?
Or should I write them to a file?
I'm sure there are books and a lot of papers on this or similar topics. Can you give me some advice what I should read?
I've written a lot of spiders. To me, a bigger problem than running out of memory is the potential of losing all the URLs you've spidered already if the code or machine crashes or you decide you need to tweak the code. If you run out of RAM most machines and OSes these days will page so you'll slow down but still function. Having to rebuild a set of URLs gathered over hours and hours of run-time because its no longer available can be a real blow to productivity.
Keeping information in RAM that you do NOT want to lose is bad. Obviously a database is the way to go at that point because you need fast random access to see if you've already found a URL. Of course in-memory lookups are faster but the trade-off of figuring out WHICH urls to keep in memory adds overhead. Rather than try writing code to determine which URLs I need/don't-need, I keep it in the database and concentrate on making my code clean and maintainable and my SQL queries and schemas sensible. Make your URL field a unique index and the DBM will be able to find them in no time while automatically avoiding redundant links.
Your connection to the internet and sites you're accessing will probably be a lot slower than your connection to a database on a machine on your internal network. A SQLite database on the same machine might be the fastest, though the DBM itself isn't as sophisticated as Postgres, which is my favorite. I found that putting the database on another machine on the same switch as my spidering machine to be extremely fast; Making one machine handle the spidering, parsing, and then the database reads/writes is pretty intensive so if you have an old box throw Linux on it, install Postgres, and go to town. Throw some extra RAM in the box if you need more speed. Having that separate box for database use can be very nice.
These seem to be the important aspects to me:
There are many ways to do this and it depends on how big your database will get. I think an SQL database can provide a good model for your problem.
Probably all you need is an SQLite database. Typically string lookups for existence check is a slow operation. To speed this up you can create a CRC hash of the URL and store both the CRC and URL in your database. You would have an index on that CRC field.
There is of course a chance of collision on the URL hashes, but if 100% spanning is not important to you then you can take the hit of not having a URL in your DB when there is a collision.
You could also decrease collisions in many ways. For example you can increase the size of your CRC (CRC8 instead of CRC4) and use a hashing algorithm with a bigger size. Or use CRC as well as URL length.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With