Assume these two sets of strings:
file=sheet-2016-12-08.xlsx
file=sheet-2016-11-21.xlsx
file=sheet-2016-11-12.xlsx
file=sheet-2016-11-08.xlsx
file=sheet-2016-10-22.xlsx
file=sheet-2016-09-29.xlsx
file=sheet-2016-09-05.xlsx
file=sheet-2016-09-04.xlsx
size=1024KB
size=22KB
size=980KB
size=15019KB
size=202KB
I need to run a function on both these sets separately and receive the following output respectively:
file=sheet-2016-*.xlsx
size=*KB
The dataset can be any set of strings. It doesn't have to match the format. Here's another example for instance:
id.4030.paid
id.1280.paid
id.88.paid
For which the expected output would be:
id.*.paid
Basically, I need a function to analyze a set of strings and replace the uncommon substring with an asterisk(*)
you could use os.path.commonprefix
to compute the common prefix. It is used to compute shared directories in a list of filepaths, but it can be used in a generic context.
Then reverse the strings, and apply common prefix again, then reverse, to compute common suffix (adapted from https://gist.github.com/willwest/ca5d050fdf15232a9e67)
dataset = """id.4030.paid
id.1280.paid
id.88.paid""".splitlines()
import os
# Return the longest common suffix in a list of strings
def longest_common_suffix(list_of_strings):
reversed_strings = [s[::-1] for s in list_of_strings]
return os.path.commonprefix(reversed_strings)[::-1]
common_prefix = os.path.commonprefix(dataset)
common_suffix = longest_common_suffix(dataset)
print("{}*{}".format(common_prefix,common_suffix))
result:
id.*.paid
EDIT: as wim noted:
prefix*suffix
: should check if all strings are the sameSo a all-in-one method is required to test the list beforehand to make sure that at least 2 strings are different (condensing the prefix/suffix formula in the process), and compute common suffix with slicing to remove common prefix:
def compute_generic_string(dataset):
# edge case where all strings are the same
if len(set(dataset))==1:
return dataset[0]
commonprefix = os.path.commonprefix(dataset)
return "{}*{}".format(commonprefix,os.path.commonprefix([s[len(commonprefix):][::-1] for s in dataset])[::-1])
now let's test this:
for dataset in [['id.4030.paid','id.1280.paid','id.88.paid'],['aBBc', 'aBc'],[]]:
print(compute_generic_string(dataset))
result:
id.*.paid
aB*c
*
(when dataset is empty, code returns *
, maybe that should be another edge case)
from os.path import commonprefix
def commonsuffix(m):
return commonprefix([s[::-1] for s in m])[::-1]
def inverse_glob(strs):
start = commonprefix(strs)
n = len(start)
ends = [s[n:] for s in strs]
end = commonsuffix(ends)
if start and not any(ends):
return start
else:
return start + '*' + end
This problem is trickier than it seems at face value.
As currently specified, the question remains poorly constrained, i.e. there is not a unique solution. For input ['spamAndEggs', 'spamAndHamAndEggs']
, both spam*AndEggs
and spamAnd*Eggs
are valid answers. For input ['aXXXXz', 'aXXXz']
there are four possible solutions. In the code given above, we prefer choosing the longest possible prefix in order to make the solution unique.
Credit to JFF's answer for pointing out the existence of os.path.commonprefix
.
Inverse glob - reverse engineer a wildcard string from file names is a related and more difficult generalisation of this question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With