Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to combine items in List<string> to make new items efficiently

Tags:

c#

.net

I have a case where I have the name of an object, and a bunch of file names. I need to match the correct file name with the object. The file name can contain numbers and words, separated by either hyphen(-) or underscore(_). I have no control of either file name or object name. For example:

10-11-12_001_002_003_13001_13002_this_is_an_example.svg

The object name in this case is just a string, representing an number

10001

I need to return true or false if the file name is a match for the object name. The different segments of the file name can match on their own, or any combination of two segments. In the example above, it should be true for the following cases (not every true case, just examples):

10001
10002
10003
11001
11002
11003
12001
12002
12003
13001
13002

And, we should return false for this case (among others):

13003

What I've come up with so far is this:

public bool IsMatch(string filename, string objectname)
{
    var namesegments = GetNameSegments(filename);
    var match = namesegments.Contains(objectname);
    return match;
}

public static List<string> GetNameSegments(string filename)
{
    var segments = filename.Split('_', '-').ToList();

    var newSegments = new List<string>();
    foreach (var segment in segments)
    {
        foreach (var segment2 in segments)
        {
            if (segment == segment2)
                continue;

            var newToken = segment + segment2;
            newSegments.Add(newToken);
        }
    }

    return segments.Concat(newSegments).ToList();
}

One or two segments combined can make a match, and that is enought. Three or more segments combined should not be considered.

This does work so far, but is there a better way to do it, perhaps without nesting foreach loops?

like image 951
spersson Avatar asked Dec 14 '17 11:12

spersson


2 Answers

First: don't change debugged, working, sufficiently efficient code for no reason. Your solution looks good.

However, we can make some improvements to your solution.

public static List<string> GetNameSegments(string filename)

Making the output a list puts restrictions on the implementation that are not required by the caller. It should be IEnumerable<String>. Particularly since the caller in this case only cares about the first match.

var segments = filename.Split('_', '-').ToList();

Why ToList? A list is array-backed. You've already got an array in hand. Just use the array.

Since there is no longer a need to build up a list, we can transform your two-loop solution into an iterator block:

public static IEnumerable<string> GetNameSegments(string filename)
{
  var segments = filename.Split('_', '-');
  foreach (var segment in segments)
    yield return segment;
  foreach (var s1 in segments)
    foreach (var s2 in segments)
      if (s1 != s2)
        yield return s1 + s2; 
}

Much nicer. Alternatively we could notice that this has the structure of a query and simply return the query:

public static IEnumerable<string> GetNameSegments(string filename)
{
  var q1= filename.Split('_', '-');
  var q2 = from s1 in q1
           from s2 in q1
           where s1 != s2
           select s1 + s2;
  return q1.Concat(q2);
}

Again, much nicer in this form.

Now let's talk about efficiency. As is often the case, we can achieve greater efficiency at a cost of increased complication. This code looks like it should be plenty fast enough. Your example has nine segments. Let's suppose that nine or ten is typical. Our solutions thus far consider the ten or so singletons first, and then the hundred or so combinations. That's nothing; this code is probably fine. But what if we had thousands of segments and were considering millions of possibilities?

In that case we should restructure the algorithm. One possibility would be this general solution:

public bool IsMatch(HashSet<string> segments, string name)
{
  if (segments.Contains(name))
    return true;
  var q = from s1 in segments
          where name.StartsWith(s1)
          let s2 = name.Substring(s1.Length)
          where s1 != s2
          where segments.Contains(s2)
          select 1; // Dummy. All we care about is if there is one.
  return q.Any();
}

Your original solution is quadratic in the number of segments. This one is linear; we rely on the constant order contains operation. (This assumes of course that string operations are constant time because strings are short. If that's not true then we have a whole other kettle of fish to fry.)

How else could we extract wins in the asymptotic case?

  • If we happened to have the property that the collection was not a hash set but rather a sorted list then we could do even better; we could binary search the list to find the start and end of the range of possible prefix matches, and then pour the list into a hashset to do the suffix matches. That's still linear, but could have a smaller constant factor.

  • If we happened to know that the target string was small compared to the number of segments, we could attack the problem from the other end. Generate all possible combinations of partitions of the target string and check if both halves are in the segment set. The problem with this solution is that it is quadratic in memory usage in the size of the string. So what we'd want to do there is construct a special hash on character sequences and use that to populate the hash table, rather than the standard string hash. I'm sure you can see how the solution would go from there; I shan't spell out the details.

like image 136
Eric Lippert Avatar answered Nov 07 '22 13:11

Eric Lippert


Efficiency is very much dependent on the business problem that you're attempting to solve. Without knowing the full context/usage it's difficult to define the most efficient solution. What works for one situation won't always work for others.

I would always advocate to write working code and then solve any performance issues later down the line (or throw more tin at the problem as it's usually cheaper!) If you're having specific performance issues then please do tell us more...

I'm going to go out on a limb here and say (hope) that you're only going to be matching the filename against the object name once per execution. If that's the case I reckon this approach will be just about the fastest. In a circumstance where you're matching a single filename against multiple object names then the obvious choice is to build up an index of sorts and match against that as you were already doing, although I'd consider different types of collection depending on your expected execution/usage.

public static bool IsMatch(string filename, string objectName)
{
    var segments = filename.Split('-', '_');

    for (int i = 0; i < segments.Length; i++)
    {
        if (string.Equals(segments[i], objectName)) return true;

        for (int ii = 0; ii < segments.Length; ii++)
        {
            if (ii == i) continue;
            if (string.Equals($"{segments[i]}{segments[ii]}", objectName)) return true;
        }
    }

    return false;
}
like image 28
James Law Avatar answered Nov 07 '22 11:11

James Law