Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find two lines in file that are the same

I was asked this question in an Amazon interview.

You have a file with many lines but two of the lines are the same. Find those two lines. I gave the obvious answer that ran in N^2 time. I then came up with an answer that used a hash table, but they didn't like that answer either because they say it wouldn't work if the file was in the gigabytes. Another answer I came up with was instead of storing the hash result in memory, create a file with the same name as the hash value, and store the lines with the same same hash value in the file. Either they couldn't understand my solution or they didn't like it.

Any thoughts?

Thanks

like image 690
John Smith Avatar asked Dec 06 '12 21:12

John Smith


1 Answers

I can think of two essential classes of solutions to this problem:

  1. Probabilistic in-memory solutions. You can try to solve this problem by storing a summary of the lines of the file in main memory. You can then do computations in main memory to identify possible duplicates, then check each possible duplicate by looking back on disk. These solutions are probably the best, since they have low memory usage, high efficiency, and mimize disk accesses. Solutions in this category include

    1. Compute a hash of each line of the file, then store the hashes. Any lines that have a hash collision represent one possible pair of lines that might collide, and just those lines can be explored.
    2. Use a Bloom Filter to store all the lines of the file, then check only pairs that collide in the Bloom Filter. This is essentially a variation of (1) that is more space-efficient.
  2. Deterministic on-disk solutions. You can try to do computations with the entire data set on-disk, using main memory as temporary scratch space. This would let you get exact answers without having to hold the whole file in memory, but would probably be slower unless you were doing some later processing and could benefit from restructuring the data. Solutions in this category include

    1. Use an external sorting algorithm (external quicksort, external radix sort, etc.) to sort the file, then linear search it for a pair of duplicate elements.
    2. Build an on-disk data structure like a B-tree holding all of the strings, then query the B-tree. This takes a lot of preprocessing time, but makes future operations on the file a lot faster.
    3. Put everything in a database and query the database.

Hope this helps!

like image 182
templatetypedef Avatar answered Nov 10 '22 16:11

templatetypedef