I'm trying to detect emoji in my php code, and prevent users entering it. The code I have is: <pre class="prettyprint"><code>if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value) > 0) { //warning... } </code></pre> But doesn't work for all emoji. Any ideas?

<pre class="prettyprint"><code>if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value) </code></pre> You really want to match Unicode at a character level, rather than trying to keep track of UTF-8 byte sequences. Use the <code>u</code> modifier to treat your UTF-8 string on a character basis. The emoji are encoded in the block U+1F300–U+1F5FF. However: <ul> <li>many characters from Japanese carriers' ‘emoji’ sets are actually mapped to existing Unicode symbols, eg the card suits, zodiac signs and some arrows. Do you count these symbols as ‘emoji’ now?</li> <li>there are still systems which don't use the newly-standardised Unicode emoji code points, instead using ad-hoc ranges in the Private Use Area. Each carrier had their own encodings. iOS 4 used the Softbank set. More info. You may wish to block the entire Private Use Area.</li> </ul> eg: <pre class="prettyprint"><code>function unichr($i) { return iconv('UCS-4LE', 'UTF-8', pack('V', $i)); } if (preg_match('/['. unichr(0x1F300).'-'.unichr(0x1F5FF). unichr(0xE000).'-'.unichr(0xF8FF). ']/u'), $value) { ... } </code></pre>

From wikipedia: <blockquote> The core emoji set as of Unicode 6.0 consists of 722 characters, of which 114 characters map to sequences of one or more characters in the pre-6.0 Unicode standard, and the remaining 608 characters map to sequences of one or more characters introduced in Unicode 6.0.[4] There is no block specifically set aside for emoji – the new symbols were encoded in seven different blocks (some newly created), and there exists a Unicode data file called EmojiSources.txt that includes mappings to and from the Japanese vendors' legacy character sets. </blockquote> Here is the mapping file. There are 722 lines in the file, each one representing one of the 722 emoticons. It seems like this is not an easy thing to do because there is not a specific block set aside for emoji. You need to adjust your regex to cover all of the emoji unicodes. You could match an individual unicode like so: <pre class="prettyprint"><code>\x{1F30F} </code></pre> 1F30F is the unicode for an emoticon of a globe. Sorry I don't have a full answer for you, but this should get you headed in the right direction.

php find emoji [update existing code]

Tags:

regex

php

unicode

emoji

I'm trying to detect emoji in my php code, and prevent users entering it.

The code I have is:

if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value) > 0)
{
    //warning...
}

But doesn't work for all emoji. Any ideas?

425

asked May 12 '12 13:05

Kukosk

2 Answers

if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value)

You really want to match Unicode at a character level, rather than trying to keep track of UTF-8 byte sequences. Use the u modifier to treat your UTF-8 string on a character basis.

The emoji are encoded in the block U+1F300–U+1F5FF. However:

many characters from Japanese carriers' ‘emoji’ sets are actually mapped to existing Unicode symbols, eg the card suits, zodiac signs and some arrows. Do you count these symbols as ‘emoji’ now?
there are still systems which don't use the newly-standardised Unicode emoji code points, instead using ad-hoc ranges in the Private Use Area. Each carrier had their own encodings. iOS 4 used the Softbank set. More info. You may wish to block the entire Private Use Area.

eg:

function unichr($i) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}

if (preg_match('/['.
    unichr(0x1F300).'-'.unichr(0x1F5FF).
    unichr(0xE000).'-'.unichr(0xF8FF).
']/u'), $value) {
    ...
}

195

answered Oct 07 '22 10:10

bobince

From wikipedia:

The core emoji set as of Unicode 6.0 consists of 722 characters, of which 114 characters map to sequences of one or more characters in the pre-6.0 Unicode standard, and the remaining 608 characters map to sequences of one or more characters introduced in Unicode 6.0.[4] There is no block specifically set aside for emoji – the new symbols were encoded in seven different blocks (some newly created), and there exists a Unicode data file called EmojiSources.txt that includes mappings to and from the Japanese vendors' legacy character sets.

Here is the mapping file. There are 722 lines in the file, each one representing one of the 722 emoticons.

It seems like this is not an easy thing to do because there is not a specific block set aside for emoji. You need to adjust your regex to cover all of the emoji unicodes.

You could match an individual unicode like so:

\x{1F30F}

1F30F is the unicode for an emoticon of a globe.

Sorry I don't have a full answer for you, but this should get you headed in the right direction.

answered Oct 07 '22 11:10

Michael Frederick

Related questions
                            
                                What would cause php's strtotime to not work for a date in 2099?
                            
                                how to query get max id from varchar type and the values in numeric?
                            
                                How to check if gzip compression is enabled with PHP?
                            
                                Simple HTML DOM Parser - Send post variables
                            
                                Why session_ destroy() is needed?
                            
                                variable assignment inside 'IF' condition
                            
                                How to pass a variable from view to controller in codeigniter
                            
                                Joomla pagination List Limit value
                            
                                Sort array in PHP by value and maintain index association
                            
                                Create PDF file using PHP
                            
                                PHP: File upload move_uploaded_file() not working
                            
                                PHP - Get list of databases names [closed]
                            
                                How to make cyrillic chars upper case?
                            
                                PHP echo $_SERVER['PHP_SELF'] with added variable?
                            
                                RIJNDAEL encryption with Java
                            
                                Data from two tables with same column names
                            
                                How to build an android Push Notification server in PHP
                            
                                How do I Send email using Gmail through mail() ? Where do I put the password?
                            
                                PHP how to unescape HTML
                            
                                PHP - Create one image from images

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With