Split String at natural language breaks

Q: Does split () alter the original string?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string.

Q: Can split () take multiple arguments?

split() method accepts two arguments. The first optional argument is separator , which specifies what kind of separator to use for splitting the string. If this argument is not provided, the default value is any whitespace, meaning the string will split whenever .

Q: How do you split a string according to spaces?

To split a string with space as delimiter in Java, call split() method on the string object, with space " " passed as argument to the split() method. The method returns a String Array with the splits as elements in the array.

Q: What is split () function in string?

Split is used to break a delimited string into substrings. You can use either a character array or a string array to specify zero or more delimiting characters or strings. If no delimiting characters are specified, the string is split at white-space characters.

Tags:

java

string

regex

Overview

I send Strings to a Text-to-Speech server that accepts a maximum length of 300 characters. Due to network latency, there may be a delay between each section of speech being returned, so I'd like to break the speech up at the most 'natural pauses' wherever possible.

Each server request costs me money, so ideally I'd send the longest string possible, up to the maximum allowed characters.

Here is my current implementation:

private static final boolean DEBUG = true;

private static final int MAX_UTTERANCE_LENGTH = 298;
private static final int MIN_UTTERANCE_LENGTH = 200;

private static final String FULL_STOP_SPACE = ". ";
private static final String QUESTION_MARK_SPACE = "? ";
private static final String EXCLAMATION_MARK_SPACE = "! ";
private static final String LINE_SEPARATOR = System.getProperty("line.separator");
private static final String COMMA_SPACE = ", ";
private static final String JUST_A_SPACE = " ";

public static ArrayList<String> splitUtteranceNaturalBreaks(String utterance) {

    final long then = System.nanoTime();

    final ArrayList<String> speakableUtterances = new ArrayList<String>();

    int splitLocation = 0;
    String success = null;

    while (utterance.length() > MAX_UTTERANCE_LENGTH) {

        splitLocation = utterance.lastIndexOf(FULL_STOP_SPACE, MAX_UTTERANCE_LENGTH);

        if (DEBUG) {
            System.out.println("(0 FULL STOP) - last index at: " + splitLocation);
        }

        if (splitLocation < MIN_UTTERANCE_LENGTH) {
            if (DEBUG) {
                System.out.println("(1 FULL STOP) - NOT_OK");
            }

            splitLocation = utterance.lastIndexOf(QUESTION_MARK_SPACE, MAX_UTTERANCE_LENGTH);

            if (DEBUG) {
                System.out.println("(1 QUESTION MARK) - last index at: " + splitLocation);
            }

            if (splitLocation < MIN_UTTERANCE_LENGTH) {
                if (DEBUG) {
                    System.out.println("(2 QUESTION MARK) - NOT_OK");
                }

                splitLocation = utterance.lastIndexOf(EXCLAMATION_MARK_SPACE, MAX_UTTERANCE_LENGTH);

                if (DEBUG) {
                    System.out.println("(2 EXCLAMATION MARK) - last index at: " + splitLocation);
                }

                if (splitLocation < MIN_UTTERANCE_LENGTH) {
                    if (DEBUG) {
                        System.out.println("(3 EXCLAMATION MARK) - NOT_OK");
                    }

                    splitLocation = utterance.lastIndexOf(LINE_SEPARATOR, MAX_UTTERANCE_LENGTH);

                    if (DEBUG) {
                        System.out.println("(3 SEPARATOR) - last index at: " + splitLocation);
                    }

                    if (splitLocation < MIN_UTTERANCE_LENGTH) {
                        if (DEBUG) {
                            System.out.println("(4 SEPARATOR) - NOT_OK");
                        }

                        splitLocation = utterance.lastIndexOf(COMMA_SPACE, MAX_UTTERANCE_LENGTH);

                        if (DEBUG) {
                            System.out.println("(4 COMMA) - last index at: " + splitLocation);
                        }

                        if (splitLocation < MIN_UTTERANCE_LENGTH) {
                            if (DEBUG) {
                                System.out.println("(5 COMMA) - NOT_OK");
                            }

                            splitLocation = utterance.lastIndexOf(JUST_A_SPACE, MAX_UTTERANCE_LENGTH);

                            if (DEBUG) {
                                System.out.println("(5 SPACE) - last index at: " + splitLocation);
                            }

                            if (splitLocation < MIN_UTTERANCE_LENGTH) {
                                if (DEBUG) {
                                    System.out.println("(6 SPACE) - NOT_OK");
                                }

                                splitLocation = MAX_UTTERANCE_LENGTH;

                                if (DEBUG) {
                                    System.out.println("(6 MAX_UTTERANCE_LENGTH) - last index at: " + splitLocation);
                                }

                            } else {
                                if (DEBUG) {
                                    System.out.println("Accepted");
                                }

                                splitLocation -= 1;
                            }
                        }
                    } else {
                        if (DEBUG) {
                            System.out.println("Accepted");
                        }

                        splitLocation -= 1;
                    }
                } else {
                    if (DEBUG) {
                        System.out.println("Accepted");
                    }
                }
            } else {
                if (DEBUG) {
                    System.out.println("Accepted");
                }
            }
        } else {
            if (DEBUG) {
                System.out.println("Accepted");
            }
        }

        success = utterance.substring(0, (splitLocation + 2));

        speakableUtterances.add(success.trim());

        if (DEBUG) {
            System.out.println("Split - Length: " + success.length() + " -:- " + success);
            System.out.println("------------------------------");
        }

        utterance = utterance.substring((splitLocation + 2)).trim();
    }

    speakableUtterances.add(utterance);

    if (DEBUG) {

        System.out.println("Split - Length: " + utterance.length() + " -:- " + utterance);

        final long now = System.nanoTime();
        final long elapsed = now - then;

        System.out.println("ELAPSED: " + TimeUnit.MILLISECONDS.convert(elapsed, TimeUnit.NANOSECONDS));

    }

    return speakableUtterances;
}

It's ugly due to being unable to use regex within lastIndexOf. Ugly aside, it's actually pretty fast.

Problems

Ideally I'd like to use regex that allows for a match on one of my first choice delimiters:

private static final String firstChoice = "[.!?" + LINE_SEPARATOR + "]\\s+";
private static final Pattern pFirstChoice = Pattern.compile(firstChoice);

And then use a matcher to resolve the position:

    Matcher matcher = pFirstChoice.matcher(input);

    if (matcher.find()) {
        splitLocation = matcher.start();
    }

My alternative in my current implementation is to store the location of each delimiter and then select the nearest to MAX_UTTERANCE_LENGTH

I've tried various methods to apply the MIN_UTTERANCE_LENGTH & MAX_UTTERANCE_LENGTH to the Pattern, so it only captures between these values and using lookarounds to reverse iterate ?<=, but this is where my knowledge starts to fail me:

private static final String poorEffort = "([.!?]{200, 298})\\s+");

Finally

I wonder if any of you regex masters can achieve what I'm after and confirm if in actual fact, it will prove more efficient?

I thank you in advance.

References:

Split a string at a natural break (Python)
Lookarounds
Regex to Split Tokens With Minimum Size and Delimiters

776

asked Apr 29 '14 00:04

brandall

1 Answers

I would do something like this:

Pattern p = Pattern.compile(".{1,299}(?:[.!?]\\s+|\\n|$)", Pattern.DOTALL);
Matcher matcher = p.matcher(text);
while (matcher.find()) {
    speakableUtterances.add(matcher.group().trim());
}

Explanation of the regex:

.{1,299}                 any character between 1 and 299 times (matching the most amount possible)
(?:[.!?]\\s+|\\n|$)      followed by either .!? and whitespaces, a newline or the end of the string

You could consider to extend the punctuation to \p{Punct}, see javadoc for Pattern.

You can see a working sample on ideone.

answered Oct 16 '22 08:10

morja

Related questions
                            
                                Android infinitely scrolling list in both directions
                            
                                Uploading files through a play framework app to S3 without "touching" the disk
                            
                                What is the general practice to set a flag that has a different value in production than it does in dev
                            
                                Android Studio: Create well-behaved Exception Breakpoint
                            
                                Would it create issues if Number were to be implemented as an interface? Are there benefits?
                            
                                java sound on linux: how to capture from TargetDataLine quickly enough to keep up?
                            
                                JavaFX: scrollview swipe effect
                            
                                how to view my index.html file inside webapp which is cloned from swagger ui on a url
                            
                                Schedule multiple async task in android
                            
                                Java replace jar at runtime
                            
                                Bean Validation API
                            
                                How can I smooth out my terrain generator?
                            
                                com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry '' for key 'PRIMARY'
                            
                                When exactly is a class loaded? [closed]
                            
                                gradlew appengineEndpointsInstallClientLibs has error Execution failed for task ':compileJava'. > invalid source release: 1.7
                            
                                how to pass arguments to GET method in RESTfull webservice
                            
                                Removing duplicated newlines/tabs/whitespaces in XML character element
                            
                                Where is tree grammars in ANTLR4?
                            
                                Why can I write to "mnt/sdcard" and not to "mnt/extsd"?
                            
                                NetBeans: How to add library from Maven repo for non-maven project

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With