Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate a hash for a string (url) in bash for wget caching

Tags:

bash

md5

wget

I'm building a little tool that will download files using wget, reading the urls from different files. The same url may be present in different files; the url may even be present in one file several times. It would be inefficient to download a page several times (every time its url found in the list(s)).

Thus, the simple approach is to save the downloaded file and to instruct wget not to download it again if it is already there.

That would be very straightforward; however the urls are very long (many many GET parameters) and therefore cannot be used as such for filenames (wget gives the error 'Cannot write to... [] file name too long').

So, I need to rename the downloaded files. But for the caching mechanism to work, the renaming scheme needs to implement "one url <=> one name": if a given url can have multiple names, the caching does not work (ie, if I simply number the files in the order they are found, I won't let wget identify which urls have already been downloaded).

The simplest renaming scheme would be to calculate an md5 hash of the filename (and not of the file itself, which is what md5sum does); that would ensure the filename is unique and that a given url results in always the same name.

It's possible to do this in Perl, etc., but can it be done directly in bash or using a system utility (RedHat)?

like image 975
Bambax Avatar asked Oct 21 '09 17:10

Bambax


3 Answers

Sounds like you want the md5sum system utility.

URLMD5=`/bin/echo $URL | /usr/bin/md5sum | /bin/cut -f1 -d" "`

If you want to only create the hash on the filename, you can get that quickly with sed:

FILENAME=`echo $URL | /bin/sed -e 's#.*/##'`
URLMD5=`/bin/echo $FILENAME | /usr/bin/md5sum | /bin/cut -f1 -d" "`

Note that, depending on your distribution, the path to cut may be /usr/bin/cut.

like image 190
Epsilon Prime Avatar answered Nov 06 '22 12:11

Epsilon Prime


Other options on my Ubuntu (Precise) box:

  • echo -n $STRING | sha512sum
  • echo -n $STRING | sha256sum
  • echo -n $STRING | sha224sum
  • echo -n $STRING | sha384sum
  • echo -n $STRING | sha1sum
  • echo -n $STRING | shasum

Other options on my Mac:

  • echo -n $STRING | shasum -a 512
  • echo -n $STRING | shasum -a 256
  • etc.
like image 13
kdauria Avatar answered Nov 06 '22 12:11

kdauria


I don't have the rep to comment on the answer, but there's one clarification to Epsilon Prime's answer: by default, echo will print a newline at the end of the text. If you want the md5 sums to match up with what will be generated by any other tool (eg php, Java's md5, etc) you need to call

echo -n "$url"

which will suppress the newline.

like image 11
user1043466 Avatar answered Nov 06 '22 11:11

user1043466