Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I uniquely shorten a list of strings so that they are at most x characters long

I'm looking for an algorithm that will take a vector of strings v1 and return a similar vector of strings v2 where each string is less than x characters long and is unique. The strings in v1 may not be unique.

While I need to accept ASCII in v1, I'd prefer to only insert alphanumeric characters ([A-Za-z0-9]) when insertion of new characters is required.

Obviously there are three caveats here:

  1. For some values of v1 and x, there is no possible unique v2. For example, when v1 has 37 elements and x == 1.

  2. "Similar" as specified in the question is subjective. The strings will be user facing, and presumably short natural language phrases (e.g. "number of colours"). I want a human to be able to map the original to the shortened string as easily as possible. This probably means taking advantage of heuristics such as disemvoweling. Because there is probably no objective measure of my similarness construct (string distance probably won't be the most useful here, although it might) my judgement on what is good will be arbitrary. The method should be suitable for English - other languages are irrelevant.

Obviously this is a (programming) language-agnostic problem, but I'd look favourably on an implementation in python (because I find its string processing language straight-forward).

like image 603
fmark Avatar asked Apr 02 '12 06:04

fmark


2 Answers

a few notes / pointers about doing this in python.

  1. Use the bisect module to keep your result array in order to easily spot potential non-uniques. This is helpful even if v1 is already sorted (e.g. name and enemy will collide after disenvoweling)
  2. Disemvoweling can be achieved by simple calling .translate(None, "aeiouyAEIOUY") on the string.
  3. In case of duplicates you could try to resolve the collisions first by lowercasing all results and using swapcase as "bitmask", i.e. multiple occurences of aaa become ["aaa", "aaA", "aAa", "aAA"] etc. and if this is not enough "incrementing" characters starting from the end, until a non-colliding identifier is found, eg. ["aa"]*7 would become ["aa", "aA", "Aa", "AA", "ab", "aB", "Ab"]
like image 96
Kimvais Avatar answered Nov 15 '22 07:11

Kimvais


Sketch -

Develop a list of functions that reduce the size of an english string. Order the functions from least to most obscuring.

For each string in v1 repeatedly apply an obscuring function until it can no longer reduce the size of the string and then move on to the next function.

When desired size x is achieved, verify reduced string is unique with respect to strings already in v2. If so, add it to v2, if not, continue to apply obscuring functions.

Following are some ideas for size reducing functions subjectively ordered from least to most obscuring. (The random selections are intended to increase the probability of the reduced string being unique.)

  1. Replace a random occurrence of two white spaces characters with a single space
  2. Replace a random occurrence of punctuation followed by space with a single space
  3. Remove a single character word at random that is also a member of a kill list (eg "I", "a")
  4. Remove a two character word at random that is also a member of a kill list (eg "an", "of")
  5. Remove a three character word at random that is also a member of a kill list (eg "the", "and")
  6. Replace a five or more character word with word composed of first three and last character (eg "number" becomes "numr", "colours" becomes "colrs")
  7. Remove a vowel at random
  8. Remove a word that occurs in a large number of strings in v1. The idea being that very common words have low value.
  9. Translate a word/phrase to a shorter "vanity license plate" word based on a dictionary(thesaurus) ( such as http://www.baac.net/michael/plates/index.html )

(Note: the last two functions would require access to the initial unaltered string, and the correspondences between the unaltered and altered words.)

like image 26
Brian Swift Avatar answered Nov 15 '22 06:11

Brian Swift