Using SQL Server I want to store a list of URLs in a table. In addition I have the requirement that I do not want any URL to appear in the table more that once.
This would suggest that I would like to make the URL the primary key for the table but this is not possible in SQL Server because of the length of URLs. SQL Server has a constraint that the maximum length of a character field that can be indexed is 900 characters while URLs according to the spec are potentially unlimited and as a practical matter IE supports URLs up to 2k so 900 is just too short.
My next thought is to use the HashBytes function to create a hash of the URL to use as a column to be indexed. In this case the potential exists that two different URLs might hash to the same value (unlikely but possible) so I can not use a unique index.
The bulk of the processing against this table will be inserts which is the performance I wist to optimize for.
My thought is to have a URL column and a Hashvalue column and create a non-unique index on the Hashvalue.
Then I would create a Trigger for Insert which would rollback the insert if the inserted Hashvale = an existing Hashvalue and the Inserted URL = an existing URL. My hope is that the query optimizer would use the index to first find the record(s) where the Hashvalues match and then not have to do a full table scan to try and match the URL.
Am I on the right track here or is there a better way to go about this?
You shouldn't have to do anything special to store a hyperlink within your database as they are simply strings. So you'll want to use a VARCHAR or TEXT field and you may want to consider making it fairly large (ie VARCHAR(512) or VARCHAR(MAX)) as URLS "can" be quite large although you may not run into any that big.
URL is a built-in datatype of Semantic MediaWiki that is used for the most common kinds of URLs, URNs, and URIs. It accepts almost any string and interprets it as a URL. The URL appears as a link in text and the Factbox. Technically speaking, e-mail addresses and even telephone numbers are also kinds of URLs.
There is a better way.
Create a new field, int, set it to identity and auto increment it. Generally speaking using string as indexing is pretty bad, for one thing if you want to change the URL later down the line for whatever reason you are going to have to update all foreign keys which becomes horrific pretty quickly. If you have a gabillion URLs as well, your database size will balloon, a simple int field keeps size down.
I sometimes have thought that I can use other fields as primary keys, but elect for the int field and boy am I glad I did that further down the line.
Unless I misunderstand the problem. How often are you expecting to insert a URL? You could well be underestimating the capability of your database. They can perform a lot of queries, very quickly. Do some tests! There should be no reason why you can just check for the URL with a quick select statement before inserting it.
Or you could insert at will then at a later date do a batch job to remove duplicates.
Or you could queue them for inserting.
I would keep it simple. I think you might be surprised at how fast a database can be for basic queries, they were designed with that in mind.
In my mind your biggest problem is going to be how to store URLs, there are many things that can be interpreted in many ways. For example, instead of including the domain (COM, CO.UK etc) why not normalise it more and store domain extensions separately and have a table linking domains with suffixes/prefixes/protocols. Also remember http://www.example.com can be different to http://example.com in some edge cases.
If you do normalise to a higher level, then your constraints and uniques are all going to get quite a lot more complicated to manage.
Lots to think about! Make sure you design it well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With