I was asked this question in an Amazon interview.
You have a file with many lines but two of the lines are the same. Find those two lines. I gave the obvious answer that ran in N^2 time. I then came up with an answer that used a hash table, but they didn't like that answer either because they say it wouldn't work if the file was in the gigabytes. Another answer I came up with was instead of storing the hash result in memory, create a file with the same name as the hash value, and store the lines with the same same hash value in the file. Either they couldn't understand my solution or they didn't like it.
Any thoughts?
Thanks
I can think of two essential classes of solutions to this problem:
Probabilistic in-memory solutions. You can try to solve this problem by storing a summary of the lines of the file in main memory. You can then do computations in main memory to identify possible duplicates, then check each possible duplicate by looking back on disk. These solutions are probably the best, since they have low memory usage, high efficiency, and mimize disk accesses. Solutions in this category include
Deterministic on-disk solutions. You can try to do computations with the entire data set on-disk, using main memory as temporary scratch space. This would let you get exact answers without having to hold the whole file in memory, but would probably be slower unless you were doing some later processing and could benefit from restructuring the data. Solutions in this category include
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With