Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter all types of whitespace in PHP

Tags:

php

whitespace

I know that there are many types of space (em space, en space, thin space, non-breaking space, etc), but, all these, that I refered, have HTML entities (at least, PHP's htmlentities() return something like  .

But, what about those spaces that have no HTML entities?
Example: [example URL not valid anymore]
Look at the nickname of this account. It has many " " (spaces) at the front, which are visible for us (this doesn't happen with the  ).

I tried already filter with regular expressions, using \x escape, filter with str_replace(), with the space as the argument, and no luck at all!

Do you have any suggestion on how to filter ALL types of whitespace?

like image 915
Nuno Avatar asked Jul 12 '10 17:07

Nuno


2 Answers

\s by default, will not match whitespace characters with values greater than 128. To get at those, you can instead make good use of other UTF-8-aware sequences.


(Standard disclaimer: I'm skimming the PCRE source code to compile the lists below, I may miss a character or type something incorrectly. Please forgive me.)

\p{Zs} matches:

  • U+0020 Space
  • U+00A0 No-break space
  • U+1680 Ogham space mark
  • U+180E Mongolian vowel separator
  • U+2000 En quad
  • U+2001 Em quad
  • U+2002 En space
  • U+2003 Em space
  • U+2004 Three-per-em space
  • U+2005 Four-per-em space
  • U+2006 Six-per-em space
  • U+2007 Figure space
  • U+2008 Punctuation space
  • U+2009 Thin space
  • U+200A Hair space
  • U+202F Narrow no-break space
  • U+205F Medium mathematical space
  • U+3000 Ideographic space

\h (Horizontal whitespace) matches the same as \p{Zs} above, plus

  • U+0009 Horizontal tab.

Similarly for matching vertical whitespace there are a few options.

\p{Zl} matches U+2028 Line separator.

\p{Zp} matches U+2029 Paragraph separator.

\v (Vertical whitespace) matches \p{Zl}, \p{Zp} and the following

  • U+000A Linefeed
  • U+000B Vertical tab
  • U+000C Formfeed
  • U+000D Carriage return
  • U+0085 Next line

Going back to the beginning, in UTF-8 mode (i.e. using the u pattern modifier) \s will match any character that \p{Z} matches (which is anything that \p{Zs}, \p{Zl} and \p{Zp} will match), plus

  • U+0009 Horizontal tab
  • U+000A Linefeed
  • U+000C Formfeed
  • U+000D Carriage return

To cut a long story short (I bet you read all of the above, didn't you?) you might want to use \s but make sure to be in UTF-8 mode like /\s/u. Putting that to some practical use, to filter out those matching whitespace characters from a string you would do something like

$new_string = preg_replace('/\s/u', '', $old_string);

Finally, if you really, really care about the vertical whitespaces which aren't included in \s (LF and NEL) then you can use the character class [\s\v] to match all 26 of the whitespace characters listed above.

like image 133
salathe Avatar answered Oct 01 '22 05:10

salathe


They are all plain spaces (returning character code 32) that can be caught with regular expressions or trim().

Try this:

preg_replace("/\s{2,}/", " ", $text);
like image 41
animuson Avatar answered Oct 01 '22 06:10

animuson