Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you split a javascript string by spaces and punctuation?

I have some random string, for example: Hello, my name is john.. I want that string split into an array like this: Hello, ,, , my, name, is, john, .,. I tried str.split(/[^\w\s]|_/g), but it does not seem to work. Any ideas?

like image 961
chromedude Avatar asked May 28 '11 15:05

chromedude


People also ask

How do you split a string with spaces?

To split a string by multiple spaces, call the split() method, passing it a regular expression, e.g. str. trim(). split(/\s+/) . The regular expression will split the string on one or more spaces and return an array containing the substrings.

How do you separate text in JavaScript?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.

Can we divide string in JavaScript?

The split() Method in JavaScript. The split() method splits (divides) a string into two or more substrings depending on a splitter (or divider). The splitter can be a single character, another string, or a regular expression.

What is a separator in JavaScript?

separator: It is optional parameter. It defines the character or the regular expression to use for breaking the string. If not used, the same string is returned (single item array). limit: It is optional parameter.


2 Answers

This solution caused a challenge with spaces for me (still needed them), then I gave str.split(/\b/) a shot and all is well. Spaces are output in the array, which won't be hard to ignore, and the ones left after punctuation can be trimmed out.

like image 44
MikeyB Avatar answered Oct 17 '22 22:10

MikeyB


To split a str on any run of non-word characters I.e. Not A-Z, 0-9, and underscore.

var words=str.split(/\W+/);  // assumes str does not begin nor end with whitespace

Or, assuming your target language is English, you can extract all semantically useful values from a string (i.e. "tokenizing" a string) using:

var str='Here\'s a (good, bad, indifferent, ...) '+
        'example sentence to be used in this test '+
        'of English language "token-extraction".',

    punct='\\['+ '\\!'+ '\\"'+ '\\#'+ '\\$'+   // since javascript does not
          '\\%'+ '\\&'+ '\\\''+ '\\('+ '\\)'+  // support POSIX character
          '\\*'+ '\\+'+ '\\,'+ '\\\\'+ '\\-'+  // classes, we'll need our
          '\\.'+ '\\/'+ '\\:'+ '\\;'+ '\\<'+   // own version of [:punct:]
          '\\='+ '\\>'+ '\\?'+ '\\@'+ '\\['+
          '\\]'+ '\\^'+ '\\_'+ '\\`'+ '\\{'+
          '\\|'+ '\\}'+ '\\~'+ '\\]',

    re=new RegExp(     // tokenizer
       '\\s*'+            // discard possible leading whitespace
       '('+               // start capture group
         '\\.{3}'+            // ellipsis (must appear before punct)
       '|'+               // alternator
         '\\w+\\-\\w+'+       // hyphenated words (must appear before punct)
       '|'+               // alternator
         '\\w+\'(?:\\w+)?'+   // compound words (must appear before punct)
       '|'+               // alternator
         '\\w+'+              // other words
       '|'+               // alternator
         '['+punct+']'+        // punct
       ')'                // end capture group
     );

// grep(ary[,filt]) - filters an array
//   note: could use jQuery.grep() instead
// @param {Array}    ary    array of members to filter
// @param {Function} filt   function to test truthiness of member,
//   if omitted, "function(member){ if(member) return member; }" is assumed
// @returns {Array}  all members of ary where result of filter is truthy
function grep(ary,filt) {
  var result=[];
  for(var i=0,len=ary.length;i++<len;) {
    var member=ary[i]||'';
    if(filt && (typeof filt === 'Function') ? filt(member) : member) {
      result.push(member);
    }
  }
  return result;
}

var tokens=grep( str.split(re) );   // note: filter function omitted 
                                     //       since all we need to test 
                                     //       for is truthiness

which produces:


tokens=[ 
  'Here\'s',
  'a',
  '(',
  'good',
  ',',
  'bad',
  ',',
  'indifferent',
  ',',
  '...',
  ')',
  'example',
  'sentence',
  'to',
  'be',
  'used',
  'in',
  'this',
  'test',
  'of',
  'English',
  'language',
  '"',
  'token-extraction',
  '"',
  '.'
]

EDIT

Also available as a Github Gist

like image 88
Rob Raisch Avatar answered Oct 17 '22 22:10

Rob Raisch