Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many characters are visible like a space, but are not a space characters?

Tags:

regex

php

If I want to discover the hexadecimal equivalent of a space in PHP I can play with bin2hex:

php > echo var_dump(bin2hex(" "));
string(2) "20"

I can also obtain space character from "20"

php > echo var_dump(hex2bin("20"));
string(1) " "

But there exist Unicode versions of a "visible" space:

php > echo var_dump(hex2bin('c2a0'));
string(2) " "

So, I can get some string (for example from HTTP requests) where I cannot recognize the "no break space" with my eyes. So, ...

$string = preg_replace('~\x{00a0}~siu', ' ', $string);

Is there a better way to find and replace all "space like" characters in PHP?

like image 389
sensorario Avatar asked Jun 22 '15 12:06

sensorario


People also ask

What character looks like a space but isn t?

DOS 255 (decimal) is the no-break space, same as   .

What is a non space character?

Noun. nonspace (countable and uncountable, plural nonspaces) That which is not a social or physical space, or lacks the traditional attributes of spaces. quotations ▼ (computing) A text character that is not a space (or not whitespace).

How many whitespace characters are there?

There are six important white-space characters: the word space, the nonbreaking space, the tab, the hard line break, the carriage return, and the hard page break. Each white-space character has a distinct function.


1 Answers

You can make use of a Unicode category \p{Zs}:

Zs    Space separator

$string = preg_replace('~\p{Zs}~u', ' ', $string);

The \p{Zs} Unicode category class will match these space-like symbols:

Character   Name
U+0020      SPACE
U+00A0      NO-BREAK SPACE
U+1680      OGHAM SPACE MARK
U+2000      EN QUAD
U+2001      EM QUAD
U+2002      EN SPACE
U+2003      EM SPACE
U+2004      THREE-PER-EM SPACE
U+2005      FOUR-PER-EM SPACE
U+2006      SIX-PER-EM SPACE
U+2007      FIGURE SPACE
U+2008      PUNCTUATION SPACE
U+2009      THIN SPACE
U+200A      HAIR SPACE
U+202F      NARROW NO-BREAK SPACE
U+205F      MEDIUM MATHEMATICAL SPACE
U+3000      IDEOGRAPHIC SPACE
like image 189
Wiktor Stribiżew Avatar answered Dec 07 '22 23:12

Wiktor Stribiżew