Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split String at natural language breaks

Tags:

java

string

regex

Overview

I send Strings to a Text-to-Speech server that accepts a maximum length of 300 characters. Due to network latency, there may be a delay between each section of speech being returned, so I'd like to break the speech up at the most 'natural pauses' wherever possible.

Each server request costs me money, so ideally I'd send the longest string possible, up to the maximum allowed characters.

Here is my current implementation:

private static final boolean DEBUG = true;

private static final int MAX_UTTERANCE_LENGTH = 298;
private static final int MIN_UTTERANCE_LENGTH = 200;

private static final String FULL_STOP_SPACE = ". ";
private static final String QUESTION_MARK_SPACE = "? ";
private static final String EXCLAMATION_MARK_SPACE = "! ";
private static final String LINE_SEPARATOR = System.getProperty("line.separator");
private static final String COMMA_SPACE = ", ";
private static final String JUST_A_SPACE = " ";

public static ArrayList<String> splitUtteranceNaturalBreaks(String utterance) {

    final long then = System.nanoTime();

    final ArrayList<String> speakableUtterances = new ArrayList<String>();

    int splitLocation = 0;
    String success = null;

    while (utterance.length() > MAX_UTTERANCE_LENGTH) {

        splitLocation = utterance.lastIndexOf(FULL_STOP_SPACE, MAX_UTTERANCE_LENGTH);

        if (DEBUG) {
            System.out.println("(0 FULL STOP) - last index at: " + splitLocation);
        }

        if (splitLocation < MIN_UTTERANCE_LENGTH) {
            if (DEBUG) {
                System.out.println("(1 FULL STOP) - NOT_OK");
            }

            splitLocation = utterance.lastIndexOf(QUESTION_MARK_SPACE, MAX_UTTERANCE_LENGTH);

            if (DEBUG) {
                System.out.println("(1 QUESTION MARK) - last index at: " + splitLocation);
            }

            if (splitLocation < MIN_UTTERANCE_LENGTH) {
                if (DEBUG) {
                    System.out.println("(2 QUESTION MARK) - NOT_OK");
                }

                splitLocation = utterance.lastIndexOf(EXCLAMATION_MARK_SPACE, MAX_UTTERANCE_LENGTH);

                if (DEBUG) {
                    System.out.println("(2 EXCLAMATION MARK) - last index at: " + splitLocation);
                }

                if (splitLocation < MIN_UTTERANCE_LENGTH) {
                    if (DEBUG) {
                        System.out.println("(3 EXCLAMATION MARK) - NOT_OK");
                    }

                    splitLocation = utterance.lastIndexOf(LINE_SEPARATOR, MAX_UTTERANCE_LENGTH);

                    if (DEBUG) {
                        System.out.println("(3 SEPARATOR) - last index at: " + splitLocation);
                    }

                    if (splitLocation < MIN_UTTERANCE_LENGTH) {
                        if (DEBUG) {
                            System.out.println("(4 SEPARATOR) - NOT_OK");
                        }

                        splitLocation = utterance.lastIndexOf(COMMA_SPACE, MAX_UTTERANCE_LENGTH);

                        if (DEBUG) {
                            System.out.println("(4 COMMA) - last index at: " + splitLocation);
                        }

                        if (splitLocation < MIN_UTTERANCE_LENGTH) {
                            if (DEBUG) {
                                System.out.println("(5 COMMA) - NOT_OK");
                            }

                            splitLocation = utterance.lastIndexOf(JUST_A_SPACE, MAX_UTTERANCE_LENGTH);

                            if (DEBUG) {
                                System.out.println("(5 SPACE) - last index at: " + splitLocation);
                            }

                            if (splitLocation < MIN_UTTERANCE_LENGTH) {
                                if (DEBUG) {
                                    System.out.println("(6 SPACE) - NOT_OK");
                                }

                                splitLocation = MAX_UTTERANCE_LENGTH;

                                if (DEBUG) {
                                    System.out.println("(6 MAX_UTTERANCE_LENGTH) - last index at: " + splitLocation);
                                }

                            } else {
                                if (DEBUG) {
                                    System.out.println("Accepted");
                                }

                                splitLocation -= 1;
                            }
                        }
                    } else {
                        if (DEBUG) {
                            System.out.println("Accepted");
                        }

                        splitLocation -= 1;
                    }
                } else {
                    if (DEBUG) {
                        System.out.println("Accepted");
                    }
                }
            } else {
                if (DEBUG) {
                    System.out.println("Accepted");
                }
            }
        } else {
            if (DEBUG) {
                System.out.println("Accepted");
            }
        }

        success = utterance.substring(0, (splitLocation + 2));

        speakableUtterances.add(success.trim());

        if (DEBUG) {
            System.out.println("Split - Length: " + success.length() + " -:- " + success);
            System.out.println("------------------------------");
        }

        utterance = utterance.substring((splitLocation + 2)).trim();
    }

    speakableUtterances.add(utterance);

    if (DEBUG) {

        System.out.println("Split - Length: " + utterance.length() + " -:- " + utterance);

        final long now = System.nanoTime();
        final long elapsed = now - then;

        System.out.println("ELAPSED: " + TimeUnit.MILLISECONDS.convert(elapsed, TimeUnit.NANOSECONDS));

    }

    return speakableUtterances;
}

It's ugly due to being unable to use regex within lastIndexOf. Ugly aside, it's actually pretty fast.

Problems

Ideally I'd like to use regex that allows for a match on one of my first choice delimiters:

private static final String firstChoice = "[.!?" + LINE_SEPARATOR + "]\\s+";
private static final Pattern pFirstChoice = Pattern.compile(firstChoice);

And then use a matcher to resolve the position:

    Matcher matcher = pFirstChoice.matcher(input);

    if (matcher.find()) {
        splitLocation = matcher.start();
    }

My alternative in my current implementation is to store the location of each delimiter and then select the nearest to MAX_UTTERANCE_LENGTH

I've tried various methods to apply the MIN_UTTERANCE_LENGTH & MAX_UTTERANCE_LENGTH to the Pattern, so it only captures between these values and using lookarounds to reverse iterate ?<=, but this is where my knowledge starts to fail me:

private static final String poorEffort = "([.!?]{200, 298})\\s+");

Finally

I wonder if any of you regex masters can achieve what I'm after and confirm if in actual fact, it will prove more efficient?

I thank you in advance.

References:

  • Split a string at a natural break (Python)
  • Lookarounds
  • Regex to Split Tokens With Minimum Size and Delimiters
like image 776
brandall Avatar asked Apr 29 '14 00:04

brandall


People also ask

Does split () alter the original string?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string.

Can split () take multiple arguments?

split() method accepts two arguments. The first optional argument is separator , which specifies what kind of separator to use for splitting the string. If this argument is not provided, the default value is any whitespace, meaning the string will split whenever .

How do you split a string according to spaces?

To split a string with space as delimiter in Java, call split() method on the string object, with space " " passed as argument to the split() method. The method returns a String Array with the splits as elements in the array.

What is split () function in string?

Split is used to break a delimited string into substrings. You can use either a character array or a string array to specify zero or more delimiting characters or strings. If no delimiting characters are specified, the string is split at white-space characters.


1 Answers

I would do something like this:

Pattern p = Pattern.compile(".{1,299}(?:[.!?]\\s+|\\n|$)", Pattern.DOTALL);
Matcher matcher = p.matcher(text);
while (matcher.find()) {
    speakableUtterances.add(matcher.group().trim());
}

Explanation of the regex:

.{1,299}                 any character between 1 and 299 times (matching the most amount possible)
(?:[.!?]\\s+|\\n|$)      followed by either .!? and whitespaces, a newline or the end of the string

You could consider to extend the punctuation to \p{Punct}, see javadoc for Pattern.

You can see a working sample on ideone.

like image 59
morja Avatar answered Oct 16 '22 08:10

morja