Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Given a continuous stream of words, remove the duplicates

Tags:

algorithm

I was asked this question recently.

Given a continuous stream of words, remove the duplicates while reading the input.

Example:

Input: This is next stream of question see it is a question

Output: This next stream of see it is a question

Starting from end, question as well as is already appeared once, so the second time it's ignored.

My solution:

  1. Use hashing in this scenario for each word coming through stream.

  2. If there is a collision then then ignore that word.

It's definitely not a good solution. I was asked to optimize it.

What is the best approach to solve this problem?

like image 703
Arvind Avatar asked Dec 19 '22 15:12

Arvind


1 Answers

Hashing isn't a particularly bad solution.

It gives expected O(wordLength) lookup time, but O(wordLength * wordCount) in the worst case, and uses O(maxWordLength * wordCount) space.

Alternatives:

Trie

A trie is a tree data structure where each edge corresponds to a letter and the path from the root defines the value of the node.

This will give O(wordLength) lookup time and uses O(wordCount * maxWordLength) space, although the actual space usage may be lower as repeated prefixes (e.g. te in the below example) only use space once.

Binary search tree

A binary search tree is a tree data structure where each node in the subtree rooted at the left child is smaller than its parent, and similarly all nodes to the right are greater.

A self-balancing one gives O(wordLength * log wordCount) lookup time and uses O(wordCount * maxWordLength) space.

Bloom filter

A bloom filter is a data structure consisting of some number of bits and a few hash functions which maps a word to a bit, sets the output of each hash function on add and checks if any are not set on query.

This uses less space than the above solutions, but at the cost of false positives - some words will be marked as duplicates that aren't.

Specifically, it uses 1.44 log2(1/e) bits per key, where e is the false positive rate, giving O(wordCount) space usage, but with an incredibly low constant factor.

This will give O(wordLength) lookup time.

An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3.

like image 74
Bernhard Barker Avatar answered Jan 25 '23 23:01

Bernhard Barker