I need a way to split a string into several different parts based on the presence of punctuation marks or spaces.
What I mean by this, is that every word should be split into its own array element, furthermore punctuation which is at the start or at the end of the word should also be put into its own array element.
E.g:
I need to be able to turn the string Hello, Harry Potter. I'm Tom Riddle.
into
array(
"Hello",
", "
"Harry",
"Potter"
". ",
"I'm",
"Tom",
"Riddle",
". "
)
So punctuation in the middle of words (e.g. apostrophes in the middle of words) should not cause a separation
**Edit: ** so to clarify the desired behaviour, I'm
, didn't
, etc. should remain one word, but hello!
, "okay,
etc should be separated from the punctuation mark at the start or end.
Also, the punctuation marks which I want to be included in the search are:
The closest I have found to the result I need is this:
preg_split('/(\s|[\.,\/])/', $string, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
However, the problems with this are:
,
(space after), but if it is an opening bracket, it would be (
(space before).preg_split("/(\s|[\.?!,;:-(){}[]'\"…\/])/",
) I get an error. I'm pretty sure that this error is due to an unescaped character, so I ran that whole thing through preg_quote
, which returned \.\?\!,;\:\-\(\)\{\}\[\]'"…
, but this still gives the error: Parse error: syntax error, unexpected '…' (T_STRING), expecting ',' or ')' in [...][...] on line 5
My understanding of regex is fairly limited, but after looking at the php docs I can gather that the code above separates words at each whitespace it encounters, or every time it encounters a comma or a punctuation. (Correct me if I'm wrong there?) And as I understood, adding the rest of the characters within the square brackets would make it separate the string at any of those characters as well(?) Since this isn't working, I suppose I have some sort of fundamental misunderstanding about how this works, so an explanation would be greatly appreciated.
This will do it, however the output is slightly different as you included '
as a character to split on, so I'm
will be split:
$result = preg_split('/(\.\.\.\s?|[-.?!,;:(){}\[\]\'"]\s?)|\s/',
$string, null, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
It might be simplified, but I just included the ellipses ...
with an optional space OR all your other characters with an optional space OR a space.
You need to escape the dots .
outside of the character class []
, escape the [
and ]
inside the character class and -
needs to be escaped or come first or last so as not to denote a range. Obviously you need to escape the quote that you use to contain the pattern, in this case the single '
.
You didn't specify whether a space is required on either side of the punctuation and it isn't clear if this "Punctuation mid-word counts as normal punctuation" means it should or shouldn't count.
Do you really want all word-internal punctuation to stay attached? Also it looks like you want to tokenize each punctuation character separately (but attach nearby whitespace), which is most of the work. If you really do, this should do it. Comes with a test string to show it at work.
$string = "Hello, it's me-me-it's-me!!! o... (a friend?)";
print_r( preg_split("/(\w\S+\w)|(\w+)|(\s*\.{3}\s*)|(\s*[^\w\s]\s*)|\s+/", $string,
-1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE) );
Output:
Array
(
[0] => Hello
[1] => ,
[2] => it's
[3] => me-me-it's-me
[4] => !
[5] => !
[6] => !
[7] => o
[8] => ...
[9] => (
[10] => a
[11] => friend
[12] => ?
[13] => )
)
This is how it works:
(\w\S+\w)
Capture any word of 3+ characters, allowing embedded non-letters.(\w+)
Capture any word (to catch short words).(\s*\.{3}\s*)
Capture ellipsis ...
, together with any surrounding space.(\s*[^\w\s]\s*)
Capture any non-letter, non-space characters individually; but attach any nearby spaces.\s+
Any other spaces (i.e., between words) split the string, but are not captured.If you want to be selective about what can be inside a word, replace the \S+
in the first alternative with a list of what you want to allow, e.g., [\w'-]+
to allow apostrophes and hyphens only.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With