Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicate strings with limited memory [closed]

Say we have a list of strings and we can't load the entire list in to memory, but we can load parts of the list from file. What would be the best way to solve this?

like image 915
Oscar Avatar asked Jun 23 '14 15:06

Oscar


People also ask

How do I remove duplicates from a string?

The uniq command is used to remove duplicate lines from a text file in Linux. By default, this command discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead only print duplicate lines.


1 Answers

One approach would be to use external sort to sort the file, and then remove the duplicates with a single iteration on the sorted list. This approach requries very little extra space and O(nlogn) accesses to the disk.

Another approach is based on hashing: use a hashcode of the string, and load a sublist that contains all strings whom hashcode is in a specific range. It is guaranteed that if x is loaded, and it has a duplicate - the dupe will be loaded to the same bucket as well.
This requires O(n*#buckets) accesses to the disk, but might require more memory. You can invoke the procedure recursively (with different hash functions) if needed.

like image 178
amit Avatar answered Nov 07 '22 01:11

amit