A file contains a large number (eg.10 billion) of strings and you need to find duplicate Strings. You have N number of systems available. How will you find duplicates

erickson's answer is probably the one expected by whoever set this question. You could use each of the N machines as a bucket in a hashtable: <ul> <li>for each string, (say string number i in sequence) compute a hash function on it, h.</li> <li>send the the values of i and h to machine number n for storage, where n = h % N.</li> <li>from each machine, retrieve a list of all hash values h for which more than one index was received, together with the list of indexes.</li> <li>check the sets of strings with equal hash values, to see whether they're actually equal.</li> </ul> To be honest, though, for 10 billion strings you could plausibly do this on 1 PC. The hashtable might occupy something like 80-120 GB with a 32 bit hash, depending on exact hashtable implementation. If you're looking for an efficient solution, you have to be a bit more specific what you mean by "machine", because it depends how much storage each one has, and the relative cost of network communication.

Find duplicate strings in a large file

2 Answers

erickson's answer is probably the one expected by whoever set this question.

You could use each of the N machines as a bucket in a hashtable:

for each string, (say string number i in sequence) compute a hash function on it, h.
send the the values of i and h to machine number n for storage, where n = h % N.
from each machine, retrieve a list of all hash values h for which more than one index was received, together with the list of indexes.
check the sets of strings with equal hash values, to see whether they're actually equal.

To be honest, though, for 10 billion strings you could plausibly do this on 1 PC. The hashtable might occupy something like 80-120 GB with a 32 bit hash, depending on exact hashtable implementation. If you're looking for an efficient solution, you have to be a bit more specific what you mean by "machine", because it depends how much storage each one has, and the relative cost of network communication.

152

answered Oct 05 '22 10:10

Steve Jessop

Split the file into N pieces. On each machine, load as much of the piece into memory as you can, and sort the strings. Write these chunks to mass storage on that machine. On each machine, merge the chunks into a single stream, and then merge the stream from each machine into a stream that contains all of the strings in sorted order. Compare each string with the previous. If they are the same, it is a duplicate.

answered Oct 05 '22 10:10

erickson

Related questions
                            
                                Understanding a five-dimensional DP with bitshifts and XORs?
                            
                                Given a string, find its first non-repeating character in only One scan
                            
                                find iso-cost points on a 3d grid efficiently with minimum costing of points
                            
                                How to find a maximal odd decomposition of integer M?
                            
                                Select n records at random from a set of N
                            
                                Is minimum-cut same for the graph after increasing edge capacity by 1 for all edges?
                            
                                What is the main difference between CRCW and EREW in PRAM model?
                            
                                How to find maximum number of groups needed to sort an Array?
                            
                                How to check parentheses validation [duplicate]
                            
                                Partitioning weighted elements with a restriction on total partition weight
                            
                                Efficient algorithm for getting number of partitions of integer with distinct parts (Partition function Q)
                            
                                Algorithm to filter/normalise bad signal
                            
                                Stemming - code examples or open source projects?
                            
                                substring algorithm
                            
                                Multithreading - Avoiding and dealing with database deadlocks
                            
                                Quicksort algorithm program in Java
                            
                                Fractional Counting Via Integers
                            
                                Optimizing / simplifying a path
                            
                                getting rat out of a maze
                            
                                Good choice of a parallelized sorting algorithm to implement as homework?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find duplicate strings in a large file

Tags:

string

algorithm

Tushar Gupta

People also ask

2 Answers

Steve Jessop

erickson

Recent Activity

Donate For Us