What is the right way to split a string into words ? (string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.
To split a string by a regular expression, pass a regex as a parameter to the split() method, e.g. str. split(/[,. \s]/) . The split method takes a string or regular expression and splits the string based on the provided separator, into an array of substrings.
Python 3 - String split() MethodThe split() method returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
Let's assume that you have a function isWord(w)
, which checks if w
is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w
such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)]
be a table with Boolean entries. S[i]
is true if the word w[1..i]
can be split. Then set S[1] = isWord(w[1])
and for i=2
to length(w)
calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With