Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

given 10 billion URL with average length 100 characters per each url, check duplicate

Suppose I have 1GB memory available, how to find the duplicates among those urls?

I saw one solution on the book "Cracking the Coding Interview", it suggests to use hashtable to separate these urls into 4000 files x.txt, x = hash(u)%4000 in the first scan. And in the 2nd scan, we can check duplicates in each x.txt separately file.

But how can I guarantee that each file would store about 1GB url data? I think there's a chance that some files would store much more url data than other files.

My solution to this problem is to implement the file separation trick iteratively until the files are small enough for the memory available for me.

Is there any other way to do it?

like image 475
Tony L Avatar asked Jul 29 '17 19:07

Tony L


2 Answers

If you don't mind a solution which requires a bit more code, you can do the following:

  1. Calculate only the hashcodes. Each hashcode is exactly 4 bytes, so you have perfect control of the amount of memory that will be occupied by each chunk of hashcodes. You can also fit a lot more hashcodes in memory than URLs, so you will have fewer chunks.

  2. Find the duplicate hashcodes. Presumably, they are going to be much fewer than 10 billion. They might even all fit in memory.

  3. Go through the URLs again, recomputing hashcodes, seeing if a URL has one of the duplicate hashcodes, and then comparing actual URLs to rule out false positives due to hashcode collisions. (With 10 billion urls, and with hashcodes only having 4 billion different values, there will be plenty of collisions.)

like image 187
Mike Nakis Avatar answered Oct 31 '22 12:10

Mike Nakis


This is a bit long for a comment.

The truth is, you cannot guarantee that a file is going to be smaller than 1 Gbyte. I'm not sure where the 4,000 comes from. The total data volume is about 1,000 Gbytes, so the average file size would be 250 Mbytes.

It is highly unlikely that you would ever be off by a factor of 4 in size. Of course, it is possible. In that case, just split the file again into a handful of other files. This adds a negligible amount to the complexity.

What this doesn't account for is a simple case. What if one of the URLs has a length of 100 and appears 10,000,000 times in the data? Ouch! In that case, you would need to read a file and "reduce" it by combining each value with a count.

like image 23
Gordon Linoff Avatar answered Oct 31 '22 13:10

Gordon Linoff