Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

best way to store url in mysql for a read&write-intensive application

What is the best way to store url in mysql effectively for a read&write-intensive application?

I will be storing over 500,000 web addresses (all starting with either http:// or https://. no other protocols) and saving the whole url (http://example.com/path/?variable=a) into one column seems to be largely redundant because the same domain name and path will be saved to mysql multiple times.

So, initially, I thought of breaking them down (i.e. domain, path, and variables, etc) to get rid of redundancy. But I saw some posts saying that it's not recommended. Any idea on this?

Also, the application often has to retrieve urls without primary keys, meaning it has to search text to retrieve url. URL can be indexed, but I'm wondering how much performance difference there would be between storing the whole url and broken-down-url if they are all indexed under innodb(no full text indexing).

Broken-down-url will have to go through extra steps of combining them. Also, it would mean that I have to retrieve data 4 times from different tables(protocol, domain, path, variable), but it also makes the stored data in each row shorter and there would be less rows in each table. Would this possibly speed up the process?

like image 983
ap3 Avatar asked Sep 01 '25 02:09

ap3


1 Answers

I have dealt with this extensively, and my general philosophy is to use the frequency of use method. It's cumbersome, but it lets you run some great analytics on the data:

CREATE TABLE URL (
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   DomainPath    integer unsigned NOT NULL,
   QueryString   text
) Engine=MyISAM;

CREATE TABLE DomainPath (   
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   Domain        integer unsigned NOT NULL,
   Path          text,
   UNIQUE (Domain,Path)
) Engine=MyISAM;

CREATE TABLE Domain (   
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   Protocol      tinyint NOT NULL,
   Domain        varchar(64)
   Port          smallint NULL,
   UNIQUE (Protocol,Domain,Port)
) Engine=MyISAM;

As a general rule, you'll have similar Paths on a single Domain, but different QueryStrings for each path.

I originally designed this to have all parts indexed in a single table (Protocol, Domain, Path, Query String) but think the above is less space-intensive and lends better to getting better data out of it.

text tends to be slow, so you can change "Path" to a varchar after some use. Most servers die after about 1K for a URL, but I've seen some large ones and would err on the side of not losing data.

Your retrieval query is cumbersome, but if you abstract it away in your code, no issue:

SELECT CONCAT(
    IF(D.Protocol=0,'http://','https://'),
    D.Domain,
    IF(D.Port IS NULL,'',CONCAT(':',D.Port)), 
    '/', DP.Path, 
    IF(U.QueryString IS NULL,'',CONCAT('?',U.QueryString))
)
FROM URL U
INNER JOIN DomainPath DP ON U.DomainPath=DP.ID
INNER JOIN Domain D on DP.Domain=D.ID
WHERE U.ID=$DesiredID;

Store a port number if it's non standard (non-80 for http, non-443 for https), otherwise store it as NULL to signify it shouldn't be included. (You can add the logic to the MySQL but it gets much uglier.)

I would always (or never) strip the "/" from the Path as well as the "?" from the QueryString for space savings. Only loss would being able to distinguish between

http://www.example.com/
http://www.example.com/?

Which, if important, then I would change your tack to never strip it and just include it. Technically,

http://www.example.com 
http://www.example.com/

Are the same, so stripping the Path slash is OK always.

So, to parse:

http://www.example.com/my/path/to/my/file.php?id=412&crsource=google+adwords

We would use something like parse_url in PHP to produce:

array(
    [scheme] => 'http',
    [host] => 'www.example.com',
    [path] => '/my/path/to/my/file.php',
    [query] => 'id=412&crsource=google+adwords',
)

You would then check/insert (with appropriate locks, not shown):

SELECT D.ID FROM Domain D 
WHERE 
    D.Protocol=0 
    AND D.Domain='www.example.com' 
    AND D.Port IS NULL

(if doesn't exist)

INSERT INTO Domain ( 
    Protocol, Domain, Port 
) VALUES ( 
    0, 'www.example.com', NULL 
);

We then have our $DomainID going forward...

Then insert into DomainPath:

SELECT DP.ID FORM DomainPath DP WHERE 
DP.Domain=$DomainID AND Path='/my/path/to/my/file.php';

(if it doesn't exist, insert it similarly)

We then have our $DomainPathID going forward...

SELECT U.ID FROM URL 
WHERE 
    DomainPath=$DomainPathID 
    AND QueryString='id=412&crsource=google+adwords'

and insert if necessary.

Now, let me note importantly, that the above scheme will be slow for high-performance sites. You should modify everything to use a hash of some sort to speed up SELECTs. In short, the technique is like:

CREATE TABLE Foo (
     ID integer unsigned PRIMARY KEY NOT NULL AUTO_INCREMENT,
     Hash varbinary(16) NOT NULL,
     Content text
) Type=MyISAM;

SELECT ID FROM Foo WHERE Hash=UNHEX(MD5('id=412&crsource=google+adwords'));

I deliberately eliminated it from the above to keep it simple, but comparing a TEXT to another TEXT for selects is slow, and breaks for really long query strings. Don't use a fixed-length index either because that will also break. For arbitrary length strings where accuracy matters, a hash failure rate is acceptable.

Finally, if you can, do the MD5 hash client side to save sending large blobs to the server to do the MD5 operation. Most modern languages supports MD5 built-in:

SELECT ID FROM Foo WHERE Hash=UNHEX('82fd4bcf8b686cffe81e937c43b5bfeb');

But I digress.

like image 166
razzed Avatar answered Sep 02 '25 16:09

razzed