Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing URLs while Spidering

I created a little web spider in Python which I'm using to collect URLs. I'm not interested in the content. Right now I'm keeping all the visited URLs in a set in memory, because I don't want my spider to visit URLs twice. Of course that's a very limited way of accomplishing this.

So what's the best way to keep track of my visited URLs?

Should I use a database?

  • which one? MySQL, SQLite, PostgreSQL?
  • how should I save the URLs? As a primary key trying to insert every URL before visiting it?

Or should I write them to a file?

  • one file?
  • multiple files? how should I design the file-structure?

I'm sure there are books and a lot of papers on this or similar topics. Can you give me some advice what I should read?

like image 391
user313743 Avatar asked Apr 11 '10 02:04

user313743


2 Answers

I've written a lot of spiders. To me, a bigger problem than running out of memory is the potential of losing all the URLs you've spidered already if the code or machine crashes or you decide you need to tweak the code. If you run out of RAM most machines and OSes these days will page so you'll slow down but still function. Having to rebuild a set of URLs gathered over hours and hours of run-time because its no longer available can be a real blow to productivity.

Keeping information in RAM that you do NOT want to lose is bad. Obviously a database is the way to go at that point because you need fast random access to see if you've already found a URL. Of course in-memory lookups are faster but the trade-off of figuring out WHICH urls to keep in memory adds overhead. Rather than try writing code to determine which URLs I need/don't-need, I keep it in the database and concentrate on making my code clean and maintainable and my SQL queries and schemas sensible. Make your URL field a unique index and the DBM will be able to find them in no time while automatically avoiding redundant links.

Your connection to the internet and sites you're accessing will probably be a lot slower than your connection to a database on a machine on your internal network. A SQLite database on the same machine might be the fastest, though the DBM itself isn't as sophisticated as Postgres, which is my favorite. I found that putting the database on another machine on the same switch as my spidering machine to be extremely fast; Making one machine handle the spidering, parsing, and then the database reads/writes is pretty intensive so if you have an old box throw Linux on it, install Postgres, and go to town. Throw some extra RAM in the box if you need more speed. Having that separate box for database use can be very nice.

like image 179
the Tin Man Avatar answered Oct 04 '22 00:10

the Tin Man


These seem to be the important aspects to me:

  1. You can't keep the URLs in memory as RAM will get too high
  2. You need fast existence lookups at least O(logn)
  3. You need fast insertions

There are many ways to do this and it depends on how big your database will get. I think an SQL database can provide a good model for your problem.

Probably all you need is an SQLite database. Typically string lookups for existence check is a slow operation. To speed this up you can create a CRC hash of the URL and store both the CRC and URL in your database. You would have an index on that CRC field.

  • When you insert: You insert the URL and the hash
  • When you want to do an existance lookup: You take the CRC of the potentially new URL and check if it is in your database already.

There is of course a chance of collision on the URL hashes, but if 100% spanning is not important to you then you can take the hit of not having a URL in your DB when there is a collision.

You could also decrease collisions in many ways. For example you can increase the size of your CRC (CRC8 instead of CRC4) and use a hashing algorithm with a bigger size. Or use CRC as well as URL length.

like image 32
Brian R. Bondy Avatar answered Oct 03 '22 23:10

Brian R. Bondy