My problem is the following:
This is not an interview question or something similar. I need to do this for a business case.
Is there an algorithm available that let me acquire this goal (preferably in less than one month)? My first idea was Bloom/Cuckoo filter but I want to keep all of the URLs if possible.
I would implement a merge sort and eliminate the duplicates during the merge step.
You will want to stream the URLs in and create modest-sized batches that can be sorted in memory. Each of these sorted chunks are written to disk.
To merge these sorted chunks, stream in two (or more) of the files. Look at the next URL in each stream and take the smallest off of the stream, keeping track of the most recently output URL. As the next smallest URL is obtained, compare it to the most recently output URL - if it is a duplicate, skip it; otherwise output it (and remember it as the most recently output).
If your creation of sorted chunks gave you too many files to open at once, keep merging groups of files until you have one file. This result will have no duplicates.
You probably would use Arrays.parallelSort() to do the in-memory sorting of your initial chunks. You probably would benefit from removing duplicates from these initially sorted chunks while outputting the sorted array elements.
Definitely use buffered I/O.
When merging multiple files, I would create a priority queue that has the next record from each stream along with which stream it comes from. You grab the next item from the priority queue, read the next line from the stream that next item came from, and put that new line in the queue. The number of streams you can merge from will be limited either by the number of files you can have open or the memory required for buffered I/O from all the streams.
To implement this probably requires a page or so of code - it is pretty straight-forward to run this on one machine. However, if it fits in your infrastructure, this problem is a good fit for a Hadoop cluster or something like that. If you want to run this fast on, e.g., AWS, you would probably want to use a queue (e.g., SQS on AWS) to manage the chunks to be sorted/merged - it becomes more involved, but will run much faster.
Other Considerations
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With