Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a string with multiple delimiters using only String methods

Tags:

java

tokenize

I want to split a string into tokens.

I ripped of another Stack Overflow question - Equivalent to StringTokenizer with multiple characters delimiters, but I want to know if this can be done with only string methods (.equals(), .startsWith(), etc.). I don't want to use RegEx's, the StringTokenizer class, Patterns, Matchers or anything other than String for that matter.

For example, this is how I want to call the method

String[] delimiters = {" ", "==", "=", "+", "+=", "++", "-", "-=", "--", "/", "/=", "*", "*=", "(", ")", ";", "/**", "*/", "\t", "\n"};
        String splitString[] = tokenizer(contents, delimiters);

And this is the code I ripped of the other question (I don't want to do this).

    private String[] tokenizer(String string, String[] delimiters) {
        // First, create a regular expression that matches the union of the
        // delimiters
        // Be aware that, in case of delimiters containing others (example &&
        // and &),
        // the longer may be before the shorter (&& should be before &) or the
        // regexpr
        // parser will recognize && as two &.
        Arrays.sort(delimiters, new Comparator<String>() {
            @Override
            public int compare(String o1, String o2) {
                return -o1.compareTo(o2);
            }
        });
        // Build a string that will contain the regular expression
        StringBuilder regexpr = new StringBuilder();
        regexpr.append('(');
        for (String delim : delimiters) { // For each delimiter
            if (regexpr.length() != 1)
                regexpr.append('|'); // Add union separator if needed
            for (int i = 0; i < delim.length(); i++) {
                // Add an escape character if the character is a regexp reserved
                // char
                regexpr.append('\\');
                regexpr.append(delim.charAt(i));
            }
        }
        regexpr.append(')'); // Close the union
        Pattern p = Pattern.compile(regexpr.toString());

        // Now, search for the tokens
        List<String> res = new ArrayList<String>();
        Matcher m = p.matcher(string);
        int pos = 0;
        while (m.find()) { // While there's a delimiter in the string
            if (pos != m.start()) {
                // If there's something between the current and the previous
                // delimiter
                // Add it to the tokens list
                res.add(string.substring(pos, m.start()));
            }
            res.add(m.group()); // add the delimiter
            pos = m.end(); // Remember end of delimiter
        }
        if (pos != string.length()) {
            // If it remains some characters in the string after last delimiter
            // Add this to the token list
            res.add(string.substring(pos));
        }
        // Return the result
        return res.toArray(new String[res.size()]);
    }
    public static String[] clean(final String[] v) {
        List<String> list = new ArrayList<String>(Arrays.asList(v));
        list.removeAll(Collections.singleton(" "));
        return list.toArray(new String[list.size()]);
    }

Edit: I ONLY want to use string methods charAt, equals, equalsIgnoreCase, indexOf, length, and substring

like image 360
Aditya Ramkumar Avatar asked Oct 31 '15 17:10

Aditya Ramkumar


People also ask

How do I split a string with multiple delimiters in SQL?

Using the STUFF & FOR XML PATH function we can derive the input string lists into an XML format based on delimiter lists. And finally we can load the data into the temp table..

Can split () take multiple arguments?

split() method accepts two arguments. The first optional argument is separator , which specifies what kind of separator to use for splitting the string. If this argument is not provided, the default value is any whitespace, meaning the string will split whenever .

How do I split a string into multiple parts?

You can split a string by each character using an empty string('') as the splitter. In the example below, we split the same message using an empty string. The result of the split will be an array containing all the characters in the message string.

How do I use StringTokenizer with multiple delimiters?

In order to break String into tokens, you need to create a StringTokenizer object and provide a delimiter for splitting strings into tokens. You can pass multiple delimiters e.g. you can break String into tokens by, and: at the same time. If you don't provide any delimiter then by default it will use white-space.


2 Answers

EDIT: My original answer did not quite do the trick, it did not include the delimiters in the resultant array, and used the String.split() method, which was not allowed.

Here's my new solution, which is split into 2 methods:

/**
 * Splits the string at all specified literal delimiters, and includes the delimiters in the resulting array
 */
private static String[] tokenizer(String subject, String[] delimiters)  { 

    //Sort delimiters into length order, starting with longest
    Arrays.sort(delimiters, new Comparator<String>() {
        @Override
        public int compare(String s1, String s2) {
          return s2.length()-s1.length();
         }
      });

    //start with a list with only one string - the whole thing
    List<String> tokens = new ArrayList<String>();
    tokens.add(subject);

    //loop through the delimiters, splitting on each one
    for (int i=0; i<delimiters.length; i++) {
        tokens = splitStrings(tokens, delimiters, i);
    }

    return tokens.toArray(new String[] {});
}

/**
 * Splits each String in the subject at the delimiter
 */
private static List<String> splitStrings(List<String> subject, String[] delimiters, int delimiterIndex) {

    List<String> result = new ArrayList<String>();
    String delimiter = delimiters[delimiterIndex];

    //for each input string
    for (String part : subject) {

        int start = 0;

        //if this part equals one of the delimiters, don't split it up any more
        boolean alreadySplit = false;
        for (String testDelimiter : delimiters) {
            if (testDelimiter.equals(part)) {
                alreadySplit = true;
                break;
            }
        }

        if (!alreadySplit) {
            for (int index=0; index<part.length(); index++) {
                String subPart = part.substring(index);
                if (subPart.indexOf(delimiter)==0) {
                    result.add(part.substring(start, index));   // part before delimiter
                    result.add(delimiter);                      // delimiter
                    start = index+delimiter.length();           // next parts starts after delimiter
                }
            }
        }
        result.add(part.substring(start));                      // rest of string after last delimiter          
    }
    return result;
}

Original Answer

I notice you are using Pattern when you said you only wanted to use String methods.

The approach I would take would be to think of the simplest way possible. I think that is to first replace all the possible delimiters with just one delimiter, and then do the split.

Here's the code:

private String[] tokenizer(String string, String[] delimiters)  {       

    //replace all specified delimiters with one
    for (String delimiter : delimiters) {
        while (string.indexOf(delimiter)!=-1) {
            string = string.replace(delimiter, "{split}");
        }
    }

    //now split at the new delimiter
    return string.split("\\{split\\}");

}

I need to use String.replace() and not String.replaceAll() because replace() takes literal text and replaceAll() takes a regex argument, and the delimiters supplied are of literal text.

That's why I also need a while loop to replace all instances of each delimiter.

like image 116
NickJ Avatar answered Oct 21 '22 04:10

NickJ


Using only non-regex String methods... I used the startsWith(...) method, which wasn't in the exclusive list of methods that you listed because it does simply string comparison rather than a regex comparison.

The following impl:

public static void main(String ... params) {
    String haystack = "abcdefghijklmnopqrstuvwxyz";
    String [] needles = new String [] { "def", "tuv" };
    String [] tokens = splitIntoTokensUsingNeedlesFoundInHaystack(haystack, needles);
    for (String string : tokens) {
        System.out.println(string);
    }
}

private static String[] splitIntoTokensUsingNeedlesFoundInHaystack(String haystack, String[] needles) {
    List<String> list = new LinkedList<String>();
    StringBuilder builder = new StringBuilder();
    for(int haystackIndex = 0; haystackIndex < haystack.length(); haystackIndex++) {
        boolean foundAnyNeedle = false;
        String substring = haystack.substring(haystackIndex);
        for(int needleIndex = 0; (!foundAnyNeedle) && needleIndex < needles.length; needleIndex ++) {
            String needle = needles[needleIndex];
            if(substring.startsWith(needle)) {
                if(builder.length() > 0) {
                    list.add(builder.toString());
                    builder = new StringBuilder();
                }
                foundAnyNeedle = true;
                list.add(needle);
                haystackIndex += (needle.length() - 1);
            }
        }
        if( ! foundAnyNeedle) {
            builder.append(substring.charAt(0));
        }
    }
    if(builder.length() > 0) {
        list.add(builder.toString());
    }
    return list.toArray(new String[]{});
}

outputs

abc
def
ghijklmnopqrs
tuv
wxyz

Note... This code is demo-only. In the event that one of the delimiters is any empty String, it will behave poorly and eventually crash with OutOfMemoryError: Java heap space after consuming a lot of CPU.

like image 39
Nathan Avatar answered Oct 21 '22 02:10

Nathan