Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split a string in java into equal length substrings while maintaining word boundaries

Tags:

java

string

How to split a string into equal parts of maximum character length while maintaining word boundaries?

Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me

"hello "

and

"world"

But my current implementation returns

"hello w"

and

"orld   "

I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts

public static List<String> splitEqually(String text, int size) {
    // Give the list the right capacity to start with. You could use an array
    // instead if you wanted.
    List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);

    for (int start = 0; start < text.length(); start += size) {
        ret.add(text.substring(start, Math.min(text.length(), start + size)));
    }
    return ret;
}

Will it be possible to maintain word boundaries while splitting the string into substring?

To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.

like image 982
Nav Avatar asked Sep 15 '14 17:09

Nav


People also ask

How would you read a string and split it into substrings in Java?

Split() String method in Java with examples. The string split() method breaks a given string around matches of the given regular expression. After splitting against the given regular expression, this method returns a string array.

What is the function to divide multi word string into number of substrings?

The split() function returns a list object that contains the split strings, called the substrings, as elements.


2 Answers

If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)

String data = "Hello there, my name is not importnant right now."
        + " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
    System.out.println(m.group(1));

Output:

Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.

Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:

(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)

  • \G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
  • \s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
  • (.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
    • . represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
    • {1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
    • .{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
    • ( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
  • (?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:

    • space (\\s)

      OR (written as |)

    • end of the string $ after it.

So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).

like image 72
Pshemo Avatar answered Oct 20 '22 10:10

Pshemo


Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:

private String justify(String s, int limit) {
    StringBuilder justifiedText = new StringBuilder();
    StringBuilder justifiedLine = new StringBuilder();
    String[] words = s.split(" ");
    for (int i = 0; i < words.length; i++) {
        justifiedLine.append(words[i]).append(" ");
        if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
            justifiedLine.deleteCharAt(justifiedLine.length() - 1);
            justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
            justifiedLine = new StringBuilder();
        }
    }
    return justifiedText.toString();
}

Test:

String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));

Output:

Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.

It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).

PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)

like image 24
walen Avatar answered Oct 20 '22 09:10

walen