What is the best way to store url in mysql effectively for a read&write-intensive application?
I will be storing over 500,000 web addresses (all starting with either http:// or https://. no other protocols) and saving the whole url (http://example.com/path/?variable=a) into one column seems to be largely redundant because the same domain name and path will be saved to mysql multiple times.
So, initially, I thought of breaking them down (i.e. domain, path, and variables, etc) to get rid of redundancy. But I saw some posts saying that it's not recommended. Any idea on this?
Also, the application often has to retrieve urls without primary keys, meaning it has to search text to retrieve url. URL can be indexed, but I'm wondering how much performance difference there would be between storing the whole url and broken-down-url if they are all indexed under innodb(no full text indexing).
Broken-down-url will have to go through extra steps of combining them. Also, it would mean that I have to retrieve data 4 times from different tables(protocol, domain, path, variable), but it also makes the stored data in each row shorter and there would be less rows in each table. Would this possibly speed up the process?
I have dealt with this extensively, and my general philosophy is to use the frequency of use method. It's cumbersome, but it lets you run some great analytics on the data:
CREATE TABLE URL (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
DomainPath integer unsigned NOT NULL,
QueryString text
) Engine=MyISAM;
CREATE TABLE DomainPath (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
Domain integer unsigned NOT NULL,
Path text,
UNIQUE (Domain,Path)
) Engine=MyISAM;
CREATE TABLE Domain (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
Protocol tinyint NOT NULL,
Domain varchar(64)
Port smallint NULL,
UNIQUE (Protocol,Domain,Port)
) Engine=MyISAM;
As a general rule, you'll have similar Paths on a single Domain, but different QueryStrings for each path.
I originally designed this to have all parts indexed in a single table (Protocol, Domain, Path, Query String) but think the above is less space-intensive and lends better to getting better data out of it.
text
tends to be slow, so you can change "Path" to a varchar after some use. Most servers die after about 1K for a URL, but I've seen some large ones and would err on the side of not losing data.
Your retrieval query is cumbersome, but if you abstract it away in your code, no issue:
SELECT CONCAT(
IF(D.Protocol=0,'http://','https://'),
D.Domain,
IF(D.Port IS NULL,'',CONCAT(':',D.Port)),
'/', DP.Path,
IF(U.QueryString IS NULL,'',CONCAT('?',U.QueryString))
)
FROM URL U
INNER JOIN DomainPath DP ON U.DomainPath=DP.ID
INNER JOIN Domain D on DP.Domain=D.ID
WHERE U.ID=$DesiredID;
Store a port number if it's non standard (non-80 for http, non-443 for https), otherwise store it as NULL to signify it shouldn't be included. (You can add the logic to the MySQL but it gets much uglier.)
I would always (or never) strip the "/" from the Path as well as the "?" from the QueryString for space savings. Only loss would being able to distinguish between
http://www.example.com/
http://www.example.com/?
Which, if important, then I would change your tack to never strip it and just include it. Technically,
http://www.example.com
http://www.example.com/
Are the same, so stripping the Path slash is OK always.
So, to parse:
http://www.example.com/my/path/to/my/file.php?id=412&crsource=google+adwords
We would use something like parse_url
in PHP to produce:
array(
[scheme] => 'http',
[host] => 'www.example.com',
[path] => '/my/path/to/my/file.php',
[query] => 'id=412&crsource=google+adwords',
)
You would then check/insert (with appropriate locks, not shown):
SELECT D.ID FROM Domain D
WHERE
D.Protocol=0
AND D.Domain='www.example.com'
AND D.Port IS NULL
(if doesn't exist)
INSERT INTO Domain (
Protocol, Domain, Port
) VALUES (
0, 'www.example.com', NULL
);
We then have our $DomainID
going forward...
Then insert into DomainPath:
SELECT DP.ID FORM DomainPath DP WHERE
DP.Domain=$DomainID AND Path='/my/path/to/my/file.php';
(if it doesn't exist, insert it similarly)
We then have our $DomainPathID
going forward...
SELECT U.ID FROM URL
WHERE
DomainPath=$DomainPathID
AND QueryString='id=412&crsource=google+adwords'
and insert if necessary.
Now, let me note importantly, that the above scheme will be slow for high-performance sites. You should modify everything to use a hash of some sort to speed up SELECT
s. In short, the technique is like:
CREATE TABLE Foo (
ID integer unsigned PRIMARY KEY NOT NULL AUTO_INCREMENT,
Hash varbinary(16) NOT NULL,
Content text
) Type=MyISAM;
SELECT ID FROM Foo WHERE Hash=UNHEX(MD5('id=412&crsource=google+adwords'));
I deliberately eliminated it from the above to keep it simple, but comparing a TEXT to another TEXT for selects is slow, and breaks for really long query strings. Don't use a fixed-length index either because that will also break. For arbitrary length strings where accuracy matters, a hash failure rate is acceptable.
Finally, if you can, do the MD5 hash client side to save sending large blobs to the server to do the MD5 operation. Most modern languages supports MD5 built-in:
SELECT ID FROM Foo WHERE Hash=UNHEX('82fd4bcf8b686cffe81e937c43b5bfeb');
But I digress.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With