Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex - Split string into array based on punctuation/spaces

Tags:

regex

php

I need a way to split a string into several different parts based on the presence of punctuation marks or spaces.

What I mean by this, is that every word should be split into its own array element, furthermore punctuation which is at the start or at the end of the word should also be put into its own array element.

E.g: I need to be able to turn the string Hello, Harry Potter. I'm Tom Riddle. into

array(
   "Hello",
    ", "
    "Harry",
    "Potter"
    ". ",
    "I'm",
    "Tom",
    "Riddle",
    ". "
)

So punctuation in the middle of words (e.g. apostrophes in the middle of words) should not cause a separation **Edit: ** so to clarify the desired behaviour, I'm, didn't, etc. should remain one word, but hello!, "okay, etc should be separated from the punctuation mark at the start or end.

Also, the punctuation marks which I want to be included in the search are:

  • . (full stop/period)
  • ? (question mark)
  • ! (exclaimation mark)
  • , (comma)
  • ; (semi-colon)
  • : (colon)
  • (-) (hyphen-dash)
  • ( (start bracket)
  • ) (end bracket)
  • { (start squigly brace)
  • } (end squigly brace)
  • [ (start square bracket)
  • ] (end square bracket)
  • ' (single quotation mark)
  • " (double quotation mark)
  • … (elpises)

The closest I have found to the result I need is this:

preg_split('/(\s|[\.,\/])/', $string, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

However, the problems with this are:

  • Punctuation mid-word counts as normal punctuation
  • The array element containing the array element does not contain the space as well. Edit: Sorry for the vagueness; by this, I meant that I wanted the punctuation characters to contain the space which is after/before the puntuation mark. e.g. If it is a comma, it would be , (space after), but if it is an opening bracket, it would be ( (space before).
  • When I add the rest of the punctuation marks I need (preg_split("/(\s|[\.?!,;:-(){}[]'\"…\/])/",) I get an error. I'm pretty sure that this error is due to an unescaped character, so I ran that whole thing through preg_quote, which returned \.\?\!,;\:\-\(\)\{\}\[\]'"…, but this still gives the error: Parse error: syntax error, unexpected '…' (T_STRING), expecting ',' or ')' in [...][...] on line 5

My understanding of regex is fairly limited, but after looking at the php docs I can gather that the code above separates words at each whitespace it encounters, or every time it encounters a comma or a punctuation. (Correct me if I'm wrong there?) And as I understood, adding the rest of the characters within the square brackets would make it separate the string at any of those characters as well(?) Since this isn't working, I suppose I have some sort of fundamental misunderstanding about how this works, so an explanation would be greatly appreciated.

like image 353
M. Salman Khan Avatar asked May 09 '17 17:05

M. Salman Khan


2 Answers

This will do it, however the output is slightly different as you included ' as a character to split on, so I'm will be split:

$result = preg_split('/(\.\.\.\s?|[-.?!,;:(){}\[\]\'"]\s?)|\s/',
                     $string, null, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);

It might be simplified, but I just included the ellipses ... with an optional space OR all your other characters with an optional space OR a space.

You need to escape the dots . outside of the character class [], escape the [ and ] inside the character class and - needs to be escaped or come first or last so as not to denote a range. Obviously you need to escape the quote that you use to contain the pattern, in this case the single '.

You didn't specify whether a space is required on either side of the punctuation and it isn't clear if this "Punctuation mid-word counts as normal punctuation" means it should or shouldn't count.

like image 168
AbraCadaver Avatar answered Oct 27 '22 00:10

AbraCadaver


Do you really want all word-internal punctuation to stay attached? Also it looks like you want to tokenize each punctuation character separately (but attach nearby whitespace), which is most of the work. If you really do, this should do it. Comes with a test string to show it at work.

$string = "Hello, it's me-me-it's-me!!! o... (a friend?)";
print_r( preg_split("/(\w\S+\w)|(\w+)|(\s*\.{3}\s*)|(\s*[^\w\s]\s*)|\s+/", $string, 
        -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE) );

Output:

Array
(
    [0] => Hello
    [1] => ,
    [2] => it's
    [3] => me-me-it's-me
    [4] => !
    [5] => !
    [6] => !
    [7] => o
    [8] => ... 
    [9] => (
    [10] => a
    [11] => friend
    [12] => ?
    [13] => )
)

This is how it works:

  1. (\w\S+\w) Capture any word of 3+ characters, allowing embedded non-letters.
  2. (\w+) Capture any word (to catch short words).
  3. (\s*\.{3}\s*) Capture ellipsis ..., together with any surrounding space.
  4. (\s*[^\w\s]\s*) Capture any non-letter, non-space characters individually; but attach any nearby spaces.
  5. \s+ Any other spaces (i.e., between words) split the string, but are not captured.

If you want to be selective about what can be inside a word, replace the \S+ in the first alternative with a list of what you want to allow, e.g., [\w'-]+ to allow apostrophes and hyphens only.

like image 30
alexis Avatar answered Oct 27 '22 01:10

alexis