Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match special kind of whitespace

Tags:

regex

php

I have a string like that (it's an empty paragraph) saved from my heavily edited and after-processed input from TinyMCE.

That is how it looks like after echo, in HTML source code in browser:

<p> </p>

Now, I need to remove those empty paragraphs.

I have already tried

$output = str_ireplace("<p> </p>", "", $string);
$output = preg_replace("/<p> <\/p>/", "", $string);
$output = preg_replace("/<p>[ \t\n\r]*<\/p>/", "", $string);
$output = preg_replace("/<p>[\s]*<\/p>/", "", $string);

and many more variations with no luck. It's still there, intact. I have also tried mb_ereg_replace and matching &nbsp; which isn't apparently the case.

On the other hand, this works:

$output = preg_replace("/<p>.*<\/p>/", "", $string);

but of course striping also paragraphs with actual content.

What else could that "space-like" character be? How am I supposed to match it?

SOLVED Thanks to Ibizaman and this thread link, I've found the character. It is nbsp in unicode value. See http://unicodelookup.com/#160/1

This works:

$output = preg_replace("/<p>[\x{00A0}\s]*<\/p>/u", "", $string);

As pointed by mcrumley, this might work even better:

"/<p>[\p{Zs}\s]*<\/p>/iu"
like image 928
Saix Avatar asked Nov 20 '13 13:11

Saix


People also ask

What is a whitespace character?

A whitespace is any character or series of characters that represent horizontal or vertical space. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. Common whitespace characters include: For more information, see Whitespace character.

How to match number of whitespace between worda & wordb?

That works if only we have one whitespace between wordA and wordB. I need to match what ever the number of whitespaces between wordA & wordB. wordA (10 or more whitespace) wordB -> wordA wordb wordc same wordA (1 whitespace) wordB -> wordA wordb wordc ... Your regex should work 'as-is'. Assuming that it is doing what you want it to.

How do you use white space in a regex?

let whiteSpace = "Whitespace. Whitespace everywhere!" let spaceRegex = /\s/g; whiteSpace.match(spaceRegex); This match call would return [" ", " "]. Change the regex countWhiteSpace to look for multiple whitespace characters in a string. Your regex should use the global flag.

How to replace whitespace characters with a single space in JavaScript?

We can also use the String.Split () method to replace any kind of whitespace characters with a single space. The idea is to split the string using a whitespace character as a delimiter and join the non-empty sequences with a single space. The following code example shows how to implement this.


2 Answers

You can use the Unicode character property to match all spaces. \p{Zs} is "Space separator" and includes space, non-breaking space, thin space, etc. You can also use \pZ to match all separators, including line separator and paragraph separator. See http://www.php.net/manual/en/regexp.reference.unicode.php for details.

$output = preg_replace("/<p>[\p{Zs}\s]*<\/p>/iu", "", $string);
like image 174
mcrumley Avatar answered Oct 12 '22 03:10

mcrumley


Since you don't know which character is being outputted, first parse the output of $string with functions outputting unicode values (see this SO question).

Or, you can proceed the other way around and only accept well-formed paragraphs:

$output = preg_replace("/(<p>[^a-zA-Z0-9]*<\/p>)/", "\1", $string);

Disclaimer : I already put this in comments but since it solved the problem, it's better placed in an answer for future reference, I think.

like image 28
ibizaman Avatar answered Oct 12 '22 01:10

ibizaman