Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse string into words and punctuation marks using javascript

I have a string test="hello how are you all doing, I hope that it's good! and fine. Looking forward to see you.

I am trying to parse the string into words and punctuation marks using javascript. I am able to separate words but then punctuation marks disappear using the regex

var result= test.match(/\b(\w|')+\b/g);

So my expected output is

hello
how 
are 
you
all
doing
,
I
hope
that
it's
good
!
and 
fine
.
Looking
forward
to
see
you
like image 992
Imran Jawaid Avatar asked Mar 28 '26 05:03

Imran Jawaid


2 Answers

Simple approach

This first approach if you, and javascript's definition of "word" match. A more customizable approach is below.

Try test.split(/\s*\b\s*/). It splits on word boundaries (\b) and eats whitespace.

"hello how are you all doing, I hope that it's good! and fine. Looking forward to see you."
    .split(/\s*\b\s*/);
// Returns:
["hello",
"how",
"are",
"you",
"all",
"doing",
",",
"I",
"hope",
"that",
"it",
"'",
"s",
"good",
"!",
"and",
"fine",
".",
"Looking",
"forward",
"to",
"see",
"you",
"."]

How it works.

var test = "This is. A test?"; // Test string.

// First consider splitting on word boundaries (\b).
test.split(/\b/); //=> ["This"," ","is",". ","A"," ","test","?"]
// This almost works but there is some unwanted whitespace.

// So we change the split regex to gobble the whitespace using \s*
test.split(/\s*\b\s*/) //=> ["This","is",".","A","test","?"]
// Now the whitespace is included in the separator
// and not included in the result.

More involved solution.

If you want words like "isn`t" and "one-thousand" to be treated as a single word while javascript regex considers them to be two you will need to create your own definition of a word.

test.match(/[\w-']+|[^\w\s]+/g) //=> ["This","is",".","A","test","?"]

How it works

This matches the actual words an punctuation characters separately using an alternation. The first half of the regex [\w-']+ matches whatever you consider to be a word, and the second half [^\w\s]+ matches whatever you consider punctuation. In this example I just used whatever isn't a word or whitespace. I also but a + on the end so that multi-character punctuation (such as ?! which is properly written ‽) is treated as a single character, if you don't want that remove the +.

like image 103
Kevin Cox Avatar answered Mar 29 '26 20:03

Kevin Cox


Use this:

[,.!?;:]|\b[a-z']+\b

See the matches in the demo.

For instance, in JavaScript:

resultArray = yourString.match(/[,.!?;:]|\b[a-z']+\b/ig);

Explanation

  • The character class [,.!?;:] matches one character from inside the brackets
  • OR (alternation |)
  • \b match a word boundary
  • [a-z']+ one or more letters or apostrophes
  • \b word boundary
like image 33
zx81 Avatar answered Mar 29 '26 20:03

zx81



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!