I need to store user agent strings in a database for tracking and comparing customer behavior and sales performance between different browsers. A pretty plain user agent string is around 100 characters long. It was decided to use a varchar(1024)
for holding the useragent data in the database. (I know this is overkill, but that's the idea; it's supposed to accommodate useragent data for years to come and some devices, toolbars, applications are already pushing 500 characters in length.) The table holding these strings will be normalized (each distinct user agent string will only be stored once) and treated like a cache so we don't have to interpret user agents over and over again.
The typical use case is:
Note: I have a tendency to say 'searching' for the user agent string in the database because it's not a simple look up. But to be clear, the queries are going to use '=' operators, not regexes or LIKE % syntax.
So the speed of looking up the user agent string is paramount. I've explored a few methods of making sure it will have good performance. Indexing the whole column is right out for size reasons. A partial index isn't such a good idea either because most user agents have the distinguishing information at the end; the partial index would have to be fairly long to make it worthwhile by which point its size is causing problems.
So it comes down to a hash function. My thought is to hash the user agent string in web server code and run the select looking for the hash value in the database. I feel like this would minimize the load on the database server (as opposed to having it compute the hash), especially since if the hash isn't found, the code would turn around and ask the database to compute the hash again on the insert.
Hashing to an integer value would offer the best performance at the risk of higher collisions. I'm expecting to see thousands or tens of thousands user agents at the most; even 100,000 user agents would fit reasonably well into a 2^32 size integer with very few collisions which could be deciphered by the webservice with minimal performance impact. Even if you think an integer hash isn't such a good idea, using a 32 character digest (SHA-1, MD5 e.g.) should be much faster for selects than the raw string, right?
My database is MySQL InnoDB engine. The web code will be coming from C# at first and php later (after we consolidate some hosting and authentication) (not that the web code should make a big difference).
Let me apologize at this point if you think this is lame choose-my-hash-algorithm question. I'm really hoping to get some input from people who've done something similar before and their decision process. So, the question:
Your idea of hashing long strings to create a token upon which to lookup within a store (cache, or database) is a good one. I have seen this done for extremely large strings, and within high volume environments, and it works great.
"Which hash would you use for this application?"
"Would you compute the hash in code or let the db handle it?"
"Is there a radically different approach for storing/searching long strings in a database?"
Table recommendations (demonstrative only):
user
user_agent_history
user_id
int(11) unsigned not nullagent_hash
varchar(255) not nullagent
agent_hash
varchar(255) not nullbrowser
varchar(100) not nullagent
text not nullFew notes on schema:
From your OP it sounds like you need a M:M relationship between user and agent, due to the fact that a user may be using Firefox from work, but then may switch to IE9 at home. Hence the need for the pivot table.
The varchar(255) used for agent_hash
is up for debate. MySQL suggests using a varbinary column type for storing hashes, of which there are several types.
I would also suggest either making agent_hash
a primary key, or at the very least, adding a UNIQUE constraint to the column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With