I'm looking for an algorithm that will take a vector of strings v1
and return a similar vector of strings v2
where each string is less than x
characters long and is unique. The strings in v1
may not be unique.
While I need to accept ASCII in v1
, I'd prefer to only insert alphanumeric characters ([A-Za-z0-9]
) when insertion of new characters is required.
Obviously there are three caveats here:
For some values of v1
and x
, there is no possible unique v2
. For example, when v1
has 37 elements and x == 1
.
"Similar" as specified in the question is subjective. The strings will be user facing, and presumably short natural language phrases (e.g. "number of colours"). I want a human to be able to map the original to the shortened string as easily as possible. This probably means taking advantage of heuristics such as disemvoweling. Because there is probably no objective measure of my similarness construct (string distance probably won't be the most useful here, although it might) my judgement on what is good will be arbitrary. The method should be suitable for English - other languages are irrelevant.
Obviously this is a (programming) language-agnostic problem, but I'd look favourably on an implementation in python (because I find its string processing language straight-forward).
a few notes / pointers about doing this in python.
v1
is already sorted (e.g. name
and enemy
will collide after disenvoweling).translate(None, "aeiouyAEIOUY")
on the string.["aaa", "aaA", "aAa", "aAA"]
etc. and if this is not enough "incrementing" characters starting from the end, until a non-colliding identifier is found, eg. ["aa"]*7
would become ["aa", "aA", "Aa", "AA", "ab", "aB", "Ab"]
Sketch -
Develop a list of functions that reduce the size of an english string. Order the functions from least to most obscuring.
For each string in v1
repeatedly apply an obscuring function until it can no longer reduce the size of the string and then move on to the next function.
When desired size x
is achieved, verify reduced string is unique with respect to strings already in v2
. If so, add it to v2
, if not, continue to apply obscuring functions.
Following are some ideas for size reducing functions subjectively ordered from least to most obscuring. (The random selections are intended to increase the probability of the reduced string being unique.)
(Note: the last two functions would require access to the initial unaltered string, and the correspondences between the unaltered and altered words.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With