I want to split a string into tokens.
I ripped of another Stack Overflow question - Equivalent to StringTokenizer with multiple characters delimiters, but I want to know if this can be done with only string methods (.equals(), .startsWith(), etc.). I don't want to use RegEx's, the StringTokenizer class, Patterns, Matchers or anything other than String
for that matter.
For example, this is how I want to call the method
String[] delimiters = {" ", "==", "=", "+", "+=", "++", "-", "-=", "--", "/", "/=", "*", "*=", "(", ")", ";", "/**", "*/", "\t", "\n"};
String splitString[] = tokenizer(contents, delimiters);
And this is the code I ripped of the other question (I don't want to do this).
private String[] tokenizer(String string, String[] delimiters) {
// First, create a regular expression that matches the union of the
// delimiters
// Be aware that, in case of delimiters containing others (example &&
// and &),
// the longer may be before the shorter (&& should be before &) or the
// regexpr
// parser will recognize && as two &.
Arrays.sort(delimiters, new Comparator<String>() {
@Override
public int compare(String o1, String o2) {
return -o1.compareTo(o2);
}
});
// Build a string that will contain the regular expression
StringBuilder regexpr = new StringBuilder();
regexpr.append('(');
for (String delim : delimiters) { // For each delimiter
if (regexpr.length() != 1)
regexpr.append('|'); // Add union separator if needed
for (int i = 0; i < delim.length(); i++) {
// Add an escape character if the character is a regexp reserved
// char
regexpr.append('\\');
regexpr.append(delim.charAt(i));
}
}
regexpr.append(')'); // Close the union
Pattern p = Pattern.compile(regexpr.toString());
// Now, search for the tokens
List<String> res = new ArrayList<String>();
Matcher m = p.matcher(string);
int pos = 0;
while (m.find()) { // While there's a delimiter in the string
if (pos != m.start()) {
// If there's something between the current and the previous
// delimiter
// Add it to the tokens list
res.add(string.substring(pos, m.start()));
}
res.add(m.group()); // add the delimiter
pos = m.end(); // Remember end of delimiter
}
if (pos != string.length()) {
// If it remains some characters in the string after last delimiter
// Add this to the token list
res.add(string.substring(pos));
}
// Return the result
return res.toArray(new String[res.size()]);
}
public static String[] clean(final String[] v) {
List<String> list = new ArrayList<String>(Arrays.asList(v));
list.removeAll(Collections.singleton(" "));
return list.toArray(new String[list.size()]);
}
Edit: I ONLY want to use string methods charAt, equals, equalsIgnoreCase, indexOf, length, and substring
Using the STUFF & FOR XML PATH function we can derive the input string lists into an XML format based on delimiter lists. And finally we can load the data into the temp table..
split() method accepts two arguments. The first optional argument is separator , which specifies what kind of separator to use for splitting the string. If this argument is not provided, the default value is any whitespace, meaning the string will split whenever .
You can split a string by each character using an empty string('') as the splitter. In the example below, we split the same message using an empty string. The result of the split will be an array containing all the characters in the message string.
In order to break String into tokens, you need to create a StringTokenizer object and provide a delimiter for splitting strings into tokens. You can pass multiple delimiters e.g. you can break String into tokens by, and: at the same time. If you don't provide any delimiter then by default it will use white-space.
EDIT: My original answer did not quite do the trick, it did not include the delimiters in the resultant array, and used the String.split() method, which was not allowed.
Here's my new solution, which is split into 2 methods:
/**
* Splits the string at all specified literal delimiters, and includes the delimiters in the resulting array
*/
private static String[] tokenizer(String subject, String[] delimiters) {
//Sort delimiters into length order, starting with longest
Arrays.sort(delimiters, new Comparator<String>() {
@Override
public int compare(String s1, String s2) {
return s2.length()-s1.length();
}
});
//start with a list with only one string - the whole thing
List<String> tokens = new ArrayList<String>();
tokens.add(subject);
//loop through the delimiters, splitting on each one
for (int i=0; i<delimiters.length; i++) {
tokens = splitStrings(tokens, delimiters, i);
}
return tokens.toArray(new String[] {});
}
/**
* Splits each String in the subject at the delimiter
*/
private static List<String> splitStrings(List<String> subject, String[] delimiters, int delimiterIndex) {
List<String> result = new ArrayList<String>();
String delimiter = delimiters[delimiterIndex];
//for each input string
for (String part : subject) {
int start = 0;
//if this part equals one of the delimiters, don't split it up any more
boolean alreadySplit = false;
for (String testDelimiter : delimiters) {
if (testDelimiter.equals(part)) {
alreadySplit = true;
break;
}
}
if (!alreadySplit) {
for (int index=0; index<part.length(); index++) {
String subPart = part.substring(index);
if (subPart.indexOf(delimiter)==0) {
result.add(part.substring(start, index)); // part before delimiter
result.add(delimiter); // delimiter
start = index+delimiter.length(); // next parts starts after delimiter
}
}
}
result.add(part.substring(start)); // rest of string after last delimiter
}
return result;
}
Original Answer
I notice you are using Pattern
when you said you only wanted to use String methods.
The approach I would take would be to think of the simplest way possible. I think that is to first replace all the possible delimiters with just one delimiter, and then do the split.
Here's the code:
private String[] tokenizer(String string, String[] delimiters) {
//replace all specified delimiters with one
for (String delimiter : delimiters) {
while (string.indexOf(delimiter)!=-1) {
string = string.replace(delimiter, "{split}");
}
}
//now split at the new delimiter
return string.split("\\{split\\}");
}
I need to use String.replace()
and not String.replaceAll()
because replace()
takes literal text and replaceAll()
takes a regex argument, and the delimiters supplied are of literal text.
That's why I also need a while loop to replace all instances of each delimiter.
Using only non-regex String methods... I used the startsWith(...) method, which wasn't in the exclusive list of methods that you listed because it does simply string comparison rather than a regex comparison.
The following impl:
public static void main(String ... params) {
String haystack = "abcdefghijklmnopqrstuvwxyz";
String [] needles = new String [] { "def", "tuv" };
String [] tokens = splitIntoTokensUsingNeedlesFoundInHaystack(haystack, needles);
for (String string : tokens) {
System.out.println(string);
}
}
private static String[] splitIntoTokensUsingNeedlesFoundInHaystack(String haystack, String[] needles) {
List<String> list = new LinkedList<String>();
StringBuilder builder = new StringBuilder();
for(int haystackIndex = 0; haystackIndex < haystack.length(); haystackIndex++) {
boolean foundAnyNeedle = false;
String substring = haystack.substring(haystackIndex);
for(int needleIndex = 0; (!foundAnyNeedle) && needleIndex < needles.length; needleIndex ++) {
String needle = needles[needleIndex];
if(substring.startsWith(needle)) {
if(builder.length() > 0) {
list.add(builder.toString());
builder = new StringBuilder();
}
foundAnyNeedle = true;
list.add(needle);
haystackIndex += (needle.length() - 1);
}
}
if( ! foundAnyNeedle) {
builder.append(substring.charAt(0));
}
}
if(builder.length() > 0) {
list.add(builder.toString());
}
return list.toArray(new String[]{});
}
outputs
abc
def
ghijklmnopqrs
tuv
wxyz
Note... This code is demo-only. In the event that one of the delimiters is any empty String, it will behave poorly and eventually crash with OutOfMemoryError: Java heap space after consuming a lot of CPU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With