Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate a unique hash for a URL?

Tags:

algorithm

Given these two images from twitter.

http://a3.twimg.com/profile_images/130500759/lowres_profilepic.jpg
http://a1.twimg.com/profile_images/58079916/lowres_profilepic.jpg

I want to download them to local filesystem & store them in a single directory. How shall I overcome name conflicts ?

In the example above, I cannot store them as lowres_profilepic.jpg. My design idea is treat the URLs as opaque strings except for the last segment. What algorithms (implemented as f) can I use to hash the prefixes into unique strings.

f( "http://a3.twimg.com/profile_images/130500759/" ) = 6tgjsdjfjdhgf
f( "http://a1.twimg.com/profile_images/58079916/" )  = iuhd87ysdfhdk

That way, I can save the files as:-

6tgjsdjfjdhgf_lowres_profilepic.jpg
iuhd87ysdfhdk_lowres_profilepic.jpg

I don't want a cryptographic algorithm as it this needs to be a performant operation.

like image 974
Jacques René Mesrine Avatar asked Oct 27 '09 08:10

Jacques René Mesrine


1 Answers

Irrespective of the how you do it (hashing, encoding, database lookup) I recommend that you don't try to map a huge number of URLs to files in a big flat directory.

The reason is that file lookup for most file systems involves a linear scan through the filenames in a directory. So if all N of your files are in one directory, a lookup will involve 1/2 N comparisons on average; i.e. O(N) (Note that ReiserFS organizes the names in a directory as a BTree. However, ReiserFS seems to be the exception rather than the rule.)

Instead of one big flat directory, it would be better to map the URIs to a tree of directories. Depending on the shape of the tree, lookup can be as good as O(logN). For example, if you organized the tree so that it had 3 levels of directory with at most 100 entries in each directory, you could accommodate 1 million URLs. If you designed the mapping to use 2 character filenames, each directory should easily fit into a single disk block, and a pathname lookup (assuming that the required directories are already cached) should take a few microseconds.

like image 155
Stephen C Avatar answered Sep 19 '22 13:09

Stephen C