Currently i am working on an application that splits a long column into short ones. For that i split the entire text into words, but at the moment my regex splits numbers too. What i do is this: <pre class="prettyprint"><code>str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."; sentences = str.replace(/\.+/g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|"); </code></pre> The result is: <pre class="prettyprint"><code>Array [ "This is a long string with some numbers [125.", "000,55 and 140.", "000] and an end.", " This is another sentence." ] </code></pre> The desired result would be: <pre class="prettyprint"><code>Array [ "This is a long string with some numbers [125.000, 140.000] and an end.", "This is another sentence" ] </code></pre> How do i have to change my regex to achieve this? Do i need to watch out for some problems i could run into? Or would it be good enough to search for <code>". "</code>, <code>"? "</code> and <code>"! "</code>?

<pre class="prettyprint"><code>str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|") </code></pre> Output: <pre class="prettyprint"><code>[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.', 'This is another sentence.' ] </code></pre> Breakdown: <code>([.?!])</code> = Capture either <code>.</code> or <code>?</code> or <code>!</code> <code>\s*</code> = Capture 0 or more whitespace characters following the previous token <code>([.?!])</code>. This accounts for spaces following a punctuation mark which matches the English language grammar. <code>(?=[A-Z])</code> = The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account. <hr> The replace operation uses: <pre class="prettyprint"><code>"$1|" </code></pre> We used one "capturing group" <code>([.?!])</code> and we capture one of those characters, and replace it with <code>$1</code> (the match) plus <code>|</code>. So if we captured <code>?</code> then the replacement would be <code>?|</code>. Finally, we split the pipes <code>|</code> and get our result. <hr> So, essentially, what we are saying is this: 1) Find punctuation marks (one of <code>.</code> or <code>?</code> or <code>!</code>) and capture them 2) Punctuation marks can optionally include spaces after them. 3) After a punctuation mark, I expect a capital letter. Unlike the previous regular expressions provided, this would properly match the English language grammar. From there: 4) We replace the captured punctuation marks by appending a pipe <code>|</code> 5) We split the pipes to create an array of sentences.

<pre class="prettyprint"><code>str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|") </code></pre> The RegExp (see on Debuggex): <ul> <li>(.+|:|!|\?) = The sentence can end not only by ".", "!" or "?", but also by "..." or ":"</li> <li>(\"|\'|)*|}|]) = The sentence can be surrounded by quatation marks or parenthesis</li> <li>(\s|\n|\r|\r\n) = After a sentense have to be a space or end of line</li> <li>g = global</li> <li>m = multiline</li> </ul> Remarks: <ul> <li>If you use (?=[A-Z]), the the RegExp will not work correctly in some languages. E.g. "Ü", "Č" or "Á" will not be recognised.</li> </ul>

Split string into sentences in javascript

Tags:

Currently i am working on an application that splits a long column into short ones. For that i split the entire text into words, but at the moment my regex splits numbers too.

What i do is this:

str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence."; sentences = str.replace(/\.+/g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");

The result is:

Array [     "This is a long string with some numbers [125.",     "000,55 and 140.",     "000] and an end.",     " This is another sentence." ]

The desired result would be:

Array [     "This is a long string with some numbers [125.000, 140.000] and an end.",     "This is another sentence" ]

How do i have to change my regex to achieve this? Do i need to watch out for some problems i could run into? Or would it be good enough to search for ". ", "? " and "! "?

761

asked Sep 20 '13 10:09

Tobias Golbs

2 Answers

str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")

Output:

[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',   'This is another sentence.' ]

Breakdown:

([.?!]) = Capture either . or ? or !

\s* = Capture 0 or more whitespace characters following the previous token ([.?!]). This accounts for spaces following a punctuation mark which matches the English language grammar.

(?=[A-Z]) = The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.

The replace operation uses:

"$1|"

We used one "capturing group" ([.?!]) and we capture one of those characters, and replace it with $1 (the match) plus |. So if we captured ? then the replacement would be ?|.

Finally, we split the pipes | and get our result.

So, essentially, what we are saying is this:

1) Find punctuation marks (one of . or ? or !) and capture them

2) Punctuation marks can optionally include spaces after them.

3) After a punctuation mark, I expect a capital letter.

Unlike the previous regular expressions provided, this would properly match the English language grammar.

From there:

4) We replace the captured punctuation marks by appending a pipe |

5) We split the pipes to create an array of sentences.

145

answered Oct 30 '22 11:10

JSPP

str.replace(/(\.+|\:|\!|\?)(\"*|\'*|\)*|}*|]*)(\s|\n|\r|\r\n)/gm, "$1$2|").split("|")

The RegExp (see on Debuggex):

(.+|:|!|\?) = The sentence can end not only by ".", "!" or "?", but also by "..." or ":"
(\"|\'|)*|}|]) = The sentence can be surrounded by quatation marks or parenthesis
(\s|\n|\r|\r\n) = After a sentense have to be a space or end of line
g = global
m = multiline

Remarks:

If you use (?=[A-Z]), the the RegExp will not work correctly in some languages. E.g. "Ü", "Č" or "Á" will not be recognised.

answered Oct 30 '22 10:10

Antonín Slejška

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split string into sentences in javascript

Tags:

Tobias Golbs

People also ask

2 Answers

JSPP

Antonín Slejška

Recent Activity

Donate For Us

Split string into sentences in javascript

Tags:

Tobias Golbs

People also ask

2 Answers

JSPP

Antonín Slejška

Related questions

Recent Activity

Donate For Us