I was asked this question recently.
Given a continuous stream of words, remove the duplicates while reading the input.
Example:
Input: This is next stream of question see it is a question
Output: This next stream of see it is a question
Starting from end, question
as well as is
already appeared once, so the second time it's ignored.
My solution:
Use hashing in this scenario for each word coming through stream.
If there is a collision then then ignore that word.
It's definitely not a good solution. I was asked to optimize it.
What is the best approach to solve this problem?
Hashing isn't a particularly bad solution.
It gives expected O(wordLength)
lookup time, but O(wordLength * wordCount)
in the worst case, and uses O(maxWordLength * wordCount)
space.
Alternatives:
A trie is a tree data structure where each edge corresponds to a letter and the path from the root defines the value of the node.
This will give O(wordLength)
lookup time and uses O(wordCount * maxWordLength)
space, although the actual space usage may be lower as repeated prefixes (e.g. te
in the below example) only use space once.
A binary search tree is a tree data structure where each node in the subtree rooted at the left child is smaller than its parent, and similarly all nodes to the right are greater.
A self-balancing one gives O(wordLength * log wordCount)
lookup time and uses O(wordCount * maxWordLength)
space.
A bloom filter is a data structure consisting of some number of bits and a few hash functions which maps a word to a bit, sets the output of each hash function on add and checks if any are not set on query.
This uses less space than the above solutions, but at the cost of false positives - some words will be marked as duplicates that aren't.
Specifically, it uses 1.44 log2(1/e)
bits per key, where e
is the false positive rate, giving O(wordCount)
space usage, but with an incredibly low constant factor.
This will give O(wordLength)
lookup time.
An example of a Bloom filter, representing the set
{x, y, z}
. The colored arrows show the positions in the bit array that each set element is mapped to. The elementw
is not in the set{x, y, z}
, because it hashes to one bit-array position containing 0. For this figure,m=18
andk=3
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With