Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Marking dynamic substrings in a list of strings

Assume these two sets of strings:

file=sheet-2016-12-08.xlsx
file=sheet-2016-11-21.xlsx
file=sheet-2016-11-12.xlsx
file=sheet-2016-11-08.xlsx
file=sheet-2016-10-22.xlsx
file=sheet-2016-09-29.xlsx
file=sheet-2016-09-05.xlsx
file=sheet-2016-09-04.xlsx

size=1024KB
size=22KB
size=980KB
size=15019KB
size=202KB

I need to run a function on both these sets separately and receive the following output respectively:

file=sheet-2016-*.xlsx

size=*KB

The dataset can be any set of strings. It doesn't have to match the format. Here's another example for instance:

id.4030.paid
id.1280.paid
id.88.paid

For which the expected output would be:

id.*.paid

Basically, I need a function to analyze a set of strings and replace the uncommon substring with an asterisk(*)

like image 478
HyderA Avatar asked Aug 25 '17 22:08

HyderA


2 Answers

you could use os.path.commonprefix to compute the common prefix. It is used to compute shared directories in a list of filepaths, but it can be used in a generic context.

Then reverse the strings, and apply common prefix again, then reverse, to compute common suffix (adapted from https://gist.github.com/willwest/ca5d050fdf15232a9e67)

dataset = """id.4030.paid
id.1280.paid
id.88.paid""".splitlines()

import os


# Return the longest common suffix in a list of strings
def longest_common_suffix(list_of_strings):
    reversed_strings = [s[::-1] for s in list_of_strings]
    return os.path.commonprefix(reversed_strings)[::-1]

common_prefix = os.path.commonprefix(dataset)
common_suffix = longest_common_suffix(dataset)

print("{}*{}".format(common_prefix,common_suffix))

result:

id.*.paid

EDIT: as wim noted:

  • when all strings are equal, common prefixes & suffixes are the same, but it should return the string itself instead of prefix*suffix: should check if all strings are the same
  • when common prefix & suffixes overlap/have shared letters, this confuses the computation as well: should compute common suffix on the string minus the common prefix

So a all-in-one method is required to test the list beforehand to make sure that at least 2 strings are different (condensing the prefix/suffix formula in the process), and compute common suffix with slicing to remove common prefix:

def compute_generic_string(dataset):
    # edge case where all strings are the same
    if len(set(dataset))==1:
        return dataset[0]

    commonprefix = os.path.commonprefix(dataset)

    return "{}*{}".format(commonprefix,os.path.commonprefix([s[len(commonprefix):][::-1] for s in dataset])[::-1])

now let's test this:

for dataset in [['id.4030.paid','id.1280.paid','id.88.paid'],['aBBc', 'aBc'],[]]:
    print(compute_generic_string(dataset))

result:

id.*.paid
aB*c
*

(when dataset is empty, code returns *, maybe that should be another edge case)

like image 71
Jean-François Fabre Avatar answered Oct 19 '22 09:10

Jean-François Fabre


from os.path import commonprefix

def commonsuffix(m):
    return commonprefix([s[::-1] for s in m])[::-1]

def inverse_glob(strs):
    start = commonprefix(strs)
    n = len(start)
    ends = [s[n:] for s in strs]
    end = commonsuffix(ends)
    if start and not any(ends):
        return start
    else:
        return start + '*' + end

This problem is trickier than it seems at face value.

As currently specified, the question remains poorly constrained, i.e. there is not a unique solution. For input ['spamAndEggs', 'spamAndHamAndEggs'], both spam*AndEggs and spamAnd*Eggs are valid answers. For input ['aXXXXz', 'aXXXz'] there are four possible solutions. In the code given above, we prefer choosing the longest possible prefix in order to make the solution unique.

Credit to JFF's answer for pointing out the existence of os.path.commonprefix.

Inverse glob - reverse engineer a wildcard string from file names is a related and more difficult generalisation of this question.

like image 40
wim Avatar answered Oct 19 '22 07:10

wim