Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating unique URLs in a huge dataset (150+ billions)

Tags:

java

bigdata

My problem is the following:

  • I get a list of urls consisting about ~150 billion entries.
  • Each month I get a new batch of another ~150 billion entries.
  • I should remove the duplicates and store the rest.
  • This should be done on a lonely and reasonably small machine compared to the task (around 32-64 Gb ram is available + a bunch of disk space).
  • I should store the unique urls (the storage problem is already solved).
  • My chosen language for this task is Java.

This is not an interview question or something similar. I need to do this for a business case.

Is there an algorithm available that let me acquire this goal (preferably in less than one month)? My first idea was Bloom/Cuckoo filter but I want to keep all of the URLs if possible.

like image 279
Lakatos Gyula Avatar asked Dec 22 '25 08:12

Lakatos Gyula


1 Answers

I would implement a merge sort and eliminate the duplicates during the merge step.

You will want to stream the URLs in and create modest-sized batches that can be sorted in memory. Each of these sorted chunks are written to disk.

To merge these sorted chunks, stream in two (or more) of the files. Look at the next URL in each stream and take the smallest off of the stream, keeping track of the most recently output URL. As the next smallest URL is obtained, compare it to the most recently output URL - if it is a duplicate, skip it; otherwise output it (and remember it as the most recently output).

If your creation of sorted chunks gave you too many files to open at once, keep merging groups of files until you have one file. This result will have no duplicates.

You probably would use Arrays.parallelSort() to do the in-memory sorting of your initial chunks. You probably would benefit from removing duplicates from these initially sorted chunks while outputting the sorted array elements.

Definitely use buffered I/O.

When merging multiple files, I would create a priority queue that has the next record from each stream along with which stream it comes from. You grab the next item from the priority queue, read the next line from the stream that next item came from, and put that new line in the queue. The number of streams you can merge from will be limited either by the number of files you can have open or the memory required for buffered I/O from all the streams.

To implement this probably requires a page or so of code - it is pretty straight-forward to run this on one machine. However, if it fits in your infrastructure, this problem is a good fit for a Hadoop cluster or something like that. If you want to run this fast on, e.g., AWS, you would probably want to use a queue (e.g., SQS on AWS) to manage the chunks to be sorted/merged - it becomes more involved, but will run much faster.

Other Considerations

  1. Fault tolerance: This process will take a long time to run; if it fails in the middle, do you want to have to start over from the beginning? You might want to start with a step that just streams through the source file and breaks it into appropriately-sized, unsorted chunks. This step in itself will take a while as there is a fair bit of I/O involved. Put these chunks into a directory called "unsorted", for example. Then have a second process that will repeatedly look in the "unsorted" directory, pick a chunk, read it, sort it, write it to the "sorted" directory and move the unsorted chunk from "unsorted" to "archived". Then have a third process that will read chunks from "sorted" and merge them (removing duplicates) and write to "sorted1" or "final" (depending on whether it is merging all remaining files, or not). The idea is to structure things so that you are always making forward progress so if you server dies, you can pick up where you left off.
  2. Parallelism: This process will take a long time. The more servers that you can apply to it in parallel, the faster it will go. You can achieve this (if you have servers available) in much the same way as fault tolerance - you can do the sorting and (intermediate) merging steps on many machines in parallel (with appropriate file locking or some other scheme so they don't try to work on the same chunks).
like image 51
Rob Avatar answered Dec 23 '25 22:12

Rob



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!