Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find duplicate strings in a large file

A file contains a large number (eg.10 billion) of strings and you need to find duplicate Strings. You have N number of systems available. How will you find duplicates

like image 938
Tushar Gupta Avatar asked Oct 09 '10 18:10

Tushar Gupta


People also ask

How do I find duplicates in a text file?

To start your duplicate search, go to File -> Find Duplicates or click the Find Duplicates button on the main toolbar. The Find Duplicates dialog will open, as shown below. The Find Duplicates dialog is intuitive and easy to use. The first step is to specify which folders should be searched for duplicates.

How do I find duplicate records in a text file in Unix?

The uniq command in Linux is used to display identical lines in a text file. This command can be helpful if you want to remove duplicate words or strings from a text file. Since the uniq command matches adjacent lines for finding redundant copies, it only works with sorted text files.


2 Answers

erickson's answer is probably the one expected by whoever set this question.

You could use each of the N machines as a bucket in a hashtable:

  • for each string, (say string number i in sequence) compute a hash function on it, h.
  • send the the values of i and h to machine number n for storage, where n = h % N.
  • from each machine, retrieve a list of all hash values h for which more than one index was received, together with the list of indexes.
  • check the sets of strings with equal hash values, to see whether they're actually equal.

To be honest, though, for 10 billion strings you could plausibly do this on 1 PC. The hashtable might occupy something like 80-120 GB with a 32 bit hash, depending on exact hashtable implementation. If you're looking for an efficient solution, you have to be a bit more specific what you mean by "machine", because it depends how much storage each one has, and the relative cost of network communication.

like image 152
Steve Jessop Avatar answered Oct 05 '22 10:10

Steve Jessop


Split the file into N pieces. On each machine, load as much of the piece into memory as you can, and sort the strings. Write these chunks to mass storage on that machine. On each machine, merge the chunks into a single stream, and then merge the stream from each machine into a stream that contains all of the strings in sorted order. Compare each string with the previous. If they are the same, it is a duplicate.

like image 25
erickson Avatar answered Oct 05 '22 10:10

erickson