Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string into sentences in javascript

Tags:

Currently i am working on an application that splits a long column into short ones. For that i split the entire text into words, but at the moment my regex splits numbers too.

What i do is this:

str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."; sentences = str.replace(/\.+/g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|"); 

The result is:

Array [     "This is a long string with some numbers [125.",     "000,55 and 140.",     "000] and an end.",     " This is another sentence." ] 

The desired result would be:

Array [     "This is a long string with some numbers [125.000, 140.000] and an end.",     "This is another sentence" ] 

How do i have to change my regex to achieve this? Do i need to watch out for some problems i could run into? Or would it be good enough to search for ". ", "? " and "! "?

like image 761
Tobias Golbs Avatar asked Sep 20 '13 10:09

Tobias Golbs


People also ask

How do you split a sentence in Javascript?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.

What does split do in Javascript?

split() The split() method takes a pattern and divides a String into an ordered list of substrings by searching for the pattern, puts these substrings into an array, and returns the array.

Can you splice a string Javascript?

Javascript splice is an array manipulation tool that can add and remove multiple items from an array. It works on the original array rather than create a copy. It 'mutates' the array. It doesn't work with strings but you can write your own functions to do that quite easily.


2 Answers

str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|") 

Output:

[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',   'This is another sentence.' ] 

Breakdown:

([.?!]) = Capture either . or ? or !

\s* = Capture 0 or more whitespace characters following the previous token ([.?!]). This accounts for spaces following a punctuation mark which matches the English language grammar.

(?=[A-Z]) = The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.


The replace operation uses:

"$1|" 

We used one "capturing group" ([.?!]) and we capture one of those characters, and replace it with $1 (the match) plus |. So if we captured ? then the replacement would be ?|.

Finally, we split the pipes | and get our result.


So, essentially, what we are saying is this:

1) Find punctuation marks (one of . or ? or !) and capture them

2) Punctuation marks can optionally include spaces after them.

3) After a punctuation mark, I expect a capital letter.

Unlike the previous regular expressions provided, this would properly match the English language grammar.

From there:

4) We replace the captured punctuation marks by appending a pipe |

5) We split the pipes to create an array of sentences.

like image 145
JSPP Avatar answered Oct 30 '22 11:10

JSPP


str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|") 

The RegExp (see on Debuggex):

  • (.+|:|!|\?) = The sentence can end not only by ".", "!" or "?", but also by "..." or ":"
  • (\"|\'|)*|}|]) = The sentence can be surrounded by quatation marks or parenthesis
  • (\s|\n|\r|\r\n) = After a sentense have to be a space or end of line
  • g = global
  • m = multiline

Remarks:

  • If you use (?=[A-Z]), the the RegExp will not work correctly in some languages. E.g. "Ü", "Č" or "Á" will not be recognised.
like image 41
Antonín Slejška Avatar answered Oct 30 '22 10:10

Antonín Slejška