Storing URLs in SQL Server

Tags:

sql-server

Using SQL Server I want to store a list of URLs in a table. In addition I have the requirement that I do not want any URL to appear in the table more that once.

This would suggest that I would like to make the URL the primary key for the table but this is not possible in SQL Server because of the length of URLs. SQL Server has a constraint that the maximum length of a character field that can be indexed is 900 characters while URLs according to the spec are potentially unlimited and as a practical matter IE supports URLs up to 2k so 900 is just too short.

My next thought is to use the HashBytes function to create a hash of the URL to use as a column to be indexed. In this case the potential exists that two different URLs might hash to the same value (unlikely but possible) so I can not use a unique index.

The bulk of the processing against this table will be inserts which is the performance I wist to optimize for.

My thought is to have a URL column and a Hashvalue column and create a non-unique index on the Hashvalue.

Then I would create a Trigger for Insert which would rollback the insert if the inserted Hashvale = an existing Hashvalue and the Inserted URL = an existing URL. My hope is that the query optimizer would use the index to first find the record(s) where the Hashvalues match and then not have to do a full table scan to try and match the URL.

Am I on the right track here or is there a better way to go about this?

576

asked Aug 24 '10 22:08

JonnyBoats

1 Answers

There is a better way.

Create a new field, int, set it to identity and auto increment it. Generally speaking using string as indexing is pretty bad, for one thing if you want to change the URL later down the line for whatever reason you are going to have to update all foreign keys which becomes horrific pretty quickly. If you have a gabillion URLs as well, your database size will balloon, a simple int field keeps size down.

I sometimes have thought that I can use other fields as primary keys, but elect for the int field and boy am I glad I did that further down the line.

Unless I misunderstand the problem. How often are you expecting to insert a URL? You could well be underestimating the capability of your database. They can perform a lot of queries, very quickly. Do some tests! There should be no reason why you can just check for the URL with a quick select statement before inserting it.

Or you could insert at will then at a later date do a batch job to remove duplicates.

Or you could queue them for inserting.

I would keep it simple. I think you might be surprised at how fast a database can be for basic queries, they were designed with that in mind.

In my mind your biggest problem is going to be how to store URLs, there are many things that can be interpreted in many ways. For example, instead of including the domain (COM, CO.UK etc) why not normalise it more and store domain extensions separately and have a table linking domains with suffixes/prefixes/protocols. Also remember http://www.example.com can be different to http://example.com in some edge cases.

If you do normalise to a higher level, then your constraints and uniques are all going to get quite a lot more complicated to manage.

Lots to think about! Make sure you design it well.

197

answered Nov 01 '22 15:11

Tom Gullen

Related questions
                            
                                The MARS TDS header contained errors - ASP.NET Core + EF Core 2.1.4 + Azure SQL Server
                            
                                How to prevent SQLServer JDBC Query with XML type caching entire resultset at query?
                            
                                How to properly apply recursive CTE?
                            
                                How to do a one-time load for 4 billion records from MySQL to SQL Server
                            
                                How to create custom SQL functions with R code in dbplyr?
                            
                                SQL Recursive CTE Graph Traversal
                            
                                Making XMLA/DAX requests to ISS/SSAS
                            
                                Exception The value of 'X' is unknown when attempting to save changes
                            
                                Make SQL Server index small numbers
                            
                                Is it possible to rename an SQL Server 2005 instance
                            
                                LINQ to SQL DataContext.Translate and properties with name different than the source
                            
                                Peer to peer replication in SQL Server 2005/08
                            
                                SQL Server 2005 Replication
                            
                                Exporting PDF in Reporting Services
                            
                                Update query on linked MySQL table from SQL Server
                            
                                Most useful SQL meta-queries
                            
                                SQL Server Query Slow from PHP, but FAST from SQL Mgt Studio - WHY?
                            
                                .Net SQL Server Connection String - hide password from other developers
                            
                                Problem getting the progress status of a SQL-Server restore job
                            
                                execute SSIS or DTS package asynchronously from ASP.NET

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With