Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

An efficient algorithm for finding smallest pangrammatic windows?

A pangrammatic window is a substring of a larger piece of text that contains all 26 letters of the alphabet. To quote an example from Wikipedia, given this text:

I sang, and thought I sang very well; but he just looked up into my face with a very quizzical expression, and said, 'How long have you been singing, Mademoiselle?'

The smallest pangrammatic window in the text is this string:

g very well; but he just looked up into my face with a very quizzical ex

Which indeed contains every letter at least once.

My question is this: Given a text corpus, what is the most efficient algorithm for finding the smallest pangrammatic window in the text?

I've given this some thought and come up with the following algorithms already. I have a strong feeling that these are not optimal, but I thought I'd post them as a starting point.

There is a simple naive algorithm that runs in time O(n2) and space O(1): For each position in the string, scan forward from that position and track what letters you've seen (perhaps in a bit vector, which, since there are only 26 different letters, takes space O(1)). Once you've found all 26 letters, you have the length of the shortest pangrammatic window starting at that given point. Each scan might take time O(n), and there are O(n) scans, for a grand total of O(n2) time.

We can also solve this problem in time O(n log n) and space O(n) using a modified binary search. Construct 26 arrays, one for each letter of the alphabet, then populate those arrays with the positions of each letter in the input text in sorted order. We can do this by simply scanning across the text, appending each index to the array corresponding to the current character. Once we have this, we can find, in time O(log n), the length of the shortest pangrammatic window beginning at some index by running 26 binary searches in the arrays to find the earliest time that each character appears in the input array at or after the given index. Whichever of these numbers is greatest gives the "long pole" character that appears furthest down in the string, and thus gives the endpoint of the pangrammatic window. Running this search step takes O(log n) time, and since we have to do it for all n characters in the string, the total runtime is O(n log n), with O(n) memory usage for the arrays.

A further refinement for the above approach is to replace the arrays and binary search with van Emde Boas trees and predecessor searches. This increases the creation time to O(n log log n), but reduces each search time to O(log log n) time, for a net runtime of O(n log log n) with O(n) space usage.


Are there any better algorithms out there?

like image 803
templatetypedef Avatar asked Mar 19 '12 17:03

templatetypedef


2 Answers

For every letter keep track of the recent-most sighting. Whenever you process a letter, update the corresponding sighting index and calculate the range (max-min) of sighting indexes over all letters. Find the location with minimum range.

Complexity O(n). O(nlog(m)) if you consider alphabet size m.

like image 134
ElKamina Avatar answered Sep 23 '22 07:09

ElKamina


This algorithm has O(M) space complexity and O(N) time complexity (time does not depend on alphabet size M):

  1. Advance first iterator and increase counter for each processed letter. Stop when all 26 counters are non-zero.
  2. Advance second iterator and decrease counter for each processed letter. Stop when any of these counters is zero.
  3. Use difference between iterators to update best-so-far result and continue with step 1.

This algorithm may be improved a little bit if instead of character counters, positions in the string are stored. In this case step 2 should only read these positions and compare with current position, and step 1 should update these positions and (most of the time) search for some character in the text.

like image 43
Evgeny Kluev Avatar answered Sep 21 '22 07:09

Evgeny Kluev