Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching any Unicode whitespace characters in a string with PHP regex

I want to split text message into array at every Space. It's been working just fine until I received this text message. Here is the few code lines that process the text string:

    $str = 'T bw4  05/09/19 07:51 am BW6N 499.803';
    $cleanStr = iconv("UTF-8", "ISO-8859-1", $str);
    $strArr = preg_split('/[\s\t]/', $cleanStr);
    var_dump($strArr);

Var_dump yields this result:

array:6 [▼
 0 => "T"
 1 => b"bw4  05/09/19"
 2 => "07:51"
 3 => "am"
 4 => "BW6N"
 5 => "499.803"
]

The #1 item in the array "1 => b"bw4 05/09/19"" in not correct, I am not able figure out what is the letter "b" in front of the array value. Also, the space(es) between "bw4" and "05/09/19" Any suggestion on how better achieve the string splitting are greatly appreciated. Here is the original string: https://3v4l.org/2L35M and here is the image of result from my localhost: http://prntscr.com/jjbvny

like image 527
Guntar Avatar asked Oct 19 '25 07:10

Guntar


2 Answers

To match any 1 or more Unicode whitespace chars you may use

'~\s+~u'

Your '/[\s\t]/' pattern only matches a single whitespace char (\s) or a tab (\t) (which is of course redundant as \s already matches tabs, too), but since the u modifier is missing, the \s cannot match the \u00A0 chars (hard spaces) you have after bw4.

So, use

$str = 'T bw4  05/09/19 07:51 am BW6N 499.803';
$strArr = preg_split('/\s+/u', $str);
print_r($strArr);

See the PHP demo yielding

Array
(
    [0] => T
    [1] => bw4
    [2] => 05/09/19
    [3] => 07:51
    [4] => am
    [5] => BW6N
    [6] => 499.803
)
like image 74
Wiktor Stribiżew Avatar answered Oct 20 '25 20:10

Wiktor Stribiżew


I guess your input is not properly encoded. Try:

$cleanStr = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', utf8_encode($str));

This cleans the string for me: https://3v4l.org/d80QS (if it's displayed correctly this time).

Note: This could also mean the encoding gets damaged on the way from your database (is text stored in UTF-8?), your web server (is in Apache's httpd.conf file AddDefaultCharset UTF-8 set?), or in PHP (what's your default_charset in your PHP.ini file? ="utf-8"?), the Website (<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />), or a BOM (byte-order-mark) at the beginning of your source file?

like image 36
wp78de Avatar answered Oct 20 '25 21:10

wp78de