Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find substring ignoring specified characters

Do any of you know of an easy/clean way to find a substring within a string while ignoring some specified characters to find it. I think an example would explain things better:

  • string: "Hello, -this- is a string"
  • substring to find: "Hello this"
  • chars to ignore: "," and "-"
  • found the substring, result: "Hello, -this"

Using Regex is not a requirement for me, but I added the tag because it feels related.

Update:

To make the requirement clearer: I need the resulting substring with the ignored chars, not just an indication that the given substring exists.

Update 2: Some of you are reading too much into the example, sorry, i'll give another scenario that should work:

  • string: "?A&3/3/C)412&"
  • substring to find: "A41"
  • chars to ignore: "&", "/", "3", "C", ")"
  • found the substring, result: "A&3/3/C)41"

And as a bonus (not required per se), it will be great if it's also not safe to assume that the substring to find will not have the ignored chars on it, e.g.: given the last example we should be able to do:

  • substring to find: "A3C412&"
  • chars to ignore: "&", "/", "3", "C", ")"
  • found the substring, result: "A&3/3/C)412&"

Sorry if I wasn't clear before, or still I'm not :).

Update 3:

Thanks to everyone who helped!, this is the implementation I'm working with for now:

  • http://www.pastebin.com/pYHbb43Z

An here are some tests:

  • http://www.pastebin.com/qh01GSx2

I'm using some custom extension methods I'm not including but I believe they should be self-explainatory (I will add them if you like) I've taken a lot of your ideas for the implementation and the tests but I'm giving the answer to @PierrOz because he was one of the firsts, and pointed me in the right direction. Feel free to keep giving suggestions as alternative solutions or comments on the current state of the impl. if you like.

like image 352
Fredy Treboux Avatar asked Apr 07 '10 13:04

Fredy Treboux


3 Answers

in your example you would do:

string input = "Hello, -this-, is a string";
string ignore = "[-,]*";
Regex r = new Regex(string.Format("H{0}e{0}l{0}l{0}o{0} {0}t{0}h{0}i{0}s{0}", ignore));
Match m = r.Match(input);
return m.Success ? m.Value : string.Empty;

Dynamically you would build the part [-, ] with all the characters to ignore and you would insert this part between all the characters of your query.

Take care of '-' in the class []: put it at the beginning or at the end

So more generically, it would give something like:

public string Test(string query, string input, char[] ignorelist)
{
    string ignorePattern = "[";
    for (int i=0; i<ignoreList.Length; i++)
    {
        if (ignoreList[i] == '-')
        {
            ignorePattern.Insert(1, "-");
        }
        else
        {
            ignorePattern += ignoreList[i];
        }
    }

    ignorePattern += "]*";

    for (int i = 0; i < query.Length; i++)
    {
        pattern += query[0] + ignorepattern;
    }

    Regex r = new Regex(pattern);
    Match m = r.Match(input);
    return m.IsSuccess ? m.Value : string.Empty;
}
like image 192
pierroz Avatar answered Nov 15 '22 06:11

pierroz


Here's a non-regex string extension option:

public static class StringExtensions
{
    public static bool SubstringSearch(this string s, string value, char[] ignoreChars, out string result)
    {
        if (String.IsNullOrEmpty(value))
            throw new ArgumentException("Search value cannot be null or empty.", "value");

        bool found = false;
        int matches = 0;
        int startIndex = -1;
        int length = 0;

        for (int i = 0; i < s.Length && !found; i++)
        {
            if (startIndex == -1)
            {
                if (s[i] == value[0])
                {
                    startIndex = i;
                    ++matches;
                    ++length;
                }
            }
            else
            {
                if (s[i] == value[matches])
                {
                    ++matches;
                    ++length;
                }
                else if (ignoreChars != null && ignoreChars.Contains(s[i]))
                {
                    ++length;
                }
                else
                {
                    startIndex = -1;
                    matches = 0;
                    length = 0;
                }
            }

            found = (matches == value.Length);
        }

        if (found)
        {
            result = s.Substring(startIndex, length);
        }
        else
        {
            result = null;
        }
        return found;
    }
}
like image 37
300 baud Avatar answered Nov 15 '22 06:11

300 baud


EDIT: here's an updated solution addressing the points in your recent update. The idea is the same except if you have one substring it will need to insert the ignore pattern between each character. If the substring contains spaces it will split on the spaces and insert the ignore pattern between those words. If you don't have a need for the latter functionality (which was more in line with your original question) then you can remove the Split and if checking that provides that pattern.

Note that this approach is not going to be the most efficient.

string input = @"foo ?A&3/3/C)412& bar A341C2";
string substring = "A41";
string[] ignoredChars = { "&", "/", "3", "C", ")" };

// builds up the ignored pattern and ensures a dash char is placed at the end to avoid unintended ranges
string ignoredPattern = String.Concat("[",
                            String.Join("", ignoredChars.Where(c => c != "-")
                                                        .Select(c => Regex.Escape(c)).ToArray()),
                            (ignoredChars.Contains("-") ? "-" : ""),
                            "]*?");

string[] substrings = substring.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);

string pattern = "";
if (substrings.Length > 1)
{
    pattern = String.Join(ignoredPattern, substrings);
}
else
{
    pattern = String.Join(ignoredPattern, substring.Select(c => c.ToString()).ToArray());
}

foreach (Match match in Regex.Matches(input, pattern))
{
    Console.WriteLine("Index: {0} -- Match: {1}", match.Index, match.Value);
}


Try this solution out:
string input = "Hello, -this- is a string";
string[] searchStrings = { "Hello", "this" };
string pattern = String.Join(@"\W+", searchStrings);

foreach (Match match in Regex.Matches(input, pattern))
{
    Console.WriteLine(match.Value);
}

The \W+ will match any non-alphanumeric character. If you feel like specifying them yourself, you can replace it with a character class of the characters to ignore, such as [ ,.-]+ (always place the dash character at the start or end to avoid unintended range specifications). Also, if you need case to be ignored use RegexOptions.IgnoreCase:

Regex.Matches(input, pattern, RegexOptions.IgnoreCase)

If your substring is in the form of a complete string, such as "Hello this", you can easily get it into an array form for searchString in this way:

string[] searchString = substring.Split(new[] { ' ' },
                            StringSplitOptions.RemoveEmptyEntries);
like image 26
Ahmad Mageed Avatar answered Nov 15 '22 04:11

Ahmad Mageed