I'm trying to detect emoji in my php code, and prevent users entering it.
The code I have is:
if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value) > 0)
{
//warning...
}
But doesn't work for all emoji. Any ideas?
First is calling the library itself include ('emoticons/lib/emoji.php'); (the directory you have the library may vary). The next step was to setup the variables which is easy enough. Create your variable $my_emoji_variable, don’t change anything except the reference to the Unicode i.e. 0x1f4aa.
This library allows the handling and conversion of Emoji in PHP. For background, you might want to read this first. You can download a zipfile of the latest code, which contains a helpful readme file. If you want to browse the code, it's in a public GitHub repo.
The emoji picker is initialized with the reference of the message box element. The comment box element has to be set with the HTML5 data attributes data-emojiable=true and data-emoji-input =unicode. After initialization, the emoji picker control will be displayed at the top right corner of the comment message box.
If you look through the code and notice references like 0x1f4aa remove the 0x and add U+, and there’s the Unicode, which in this example would be U+1f4aa. So to modify our code to use different emoji we need to reference a chart with all the Unicode.
if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value)
You really want to match Unicode at a character level, rather than trying to keep track of UTF-8 byte sequences. Use the u
modifier to treat your UTF-8 string on a character basis.
The emoji are encoded in the block U+1F300–U+1F5FF. However:
many characters from Japanese carriers' ‘emoji’ sets are actually mapped to existing Unicode symbols, eg the card suits, zodiac signs and some arrows. Do you count these symbols as ‘emoji’ now?
there are still systems which don't use the newly-standardised Unicode emoji code points, instead using ad-hoc ranges in the Private Use Area. Each carrier had their own encodings. iOS 4 used the Softbank set. More info. You may wish to block the entire Private Use Area.
eg:
function unichr($i) {
return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
if (preg_match('/['.
unichr(0x1F300).'-'.unichr(0x1F5FF).
unichr(0xE000).'-'.unichr(0xF8FF).
']/u'), $value) {
...
}
From wikipedia:
The core emoji set as of Unicode 6.0 consists of 722 characters, of which 114 characters map to sequences of one or more characters in the pre-6.0 Unicode standard, and the remaining 608 characters map to sequences of one or more characters introduced in Unicode 6.0.[4] There is no block specifically set aside for emoji – the new symbols were encoded in seven different blocks (some newly created), and there exists a Unicode data file called EmojiSources.txt that includes mappings to and from the Japanese vendors' legacy character sets.
Here is the mapping file. There are 722 lines in the file, each one representing one of the 722 emoticons.
It seems like this is not an easy thing to do because there is not a specific block set aside for emoji. You need to adjust your regex to cover all of the emoji unicodes.
You could match an individual unicode like so:
\x{1F30F}
1F30F is the unicode for an emoticon of a globe.
Sorry I don't have a full answer for you, but this should get you headed in the right direction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With