Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grep for emojis in linux

I am trying to grep across a list of tokens that include several non-ASCII characters. I want to match only emojis, other characters such as ð or ñ are fine. The unicode range for emojis appears to be U+1F600-U+1F1FF but when I search for it using grep this happens:

grep -P "[\x1F6-\x1F1]" contact_names.tokens                                                                                                                                                                                                                                
grep: range out of order in character class 

https://unicode.org/emoji/charts/full-emoji-list.html#1f3f4_e0067_e0062_e0077_e006c_e0073_e007f

like image 793
Drivebyluna Avatar asked Sep 10 '18 22:09

Drivebyluna


2 Answers

You need to specify the code points with full value (not 1F6 but 1F600) and wrap them with curly braces. In addition, the first value must be smaller than the last value. So the regex should be "[\x{1F1FF}-\x{1F600}]".

The unicode range for emojis is, however, more complex than you assumed. The page you referred does not sort characters by code point and emojis are placed in many blocks. If you want to cover almost all of emoji:

grep -P "[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]"  contact_names.tokens

(The range is borrowed from Suhail Gupta's answer on a similar question)

If you need to allow/disallow specific emoji blocks, see sequence data on unicode.org. List of emoji on Wikipedia also show characters in ordered tables but it might not list latest ones.

like image 123
rad164 Avatar answered Oct 01 '22 12:10

rad164


You could use ugrep as a drop-in replacement for grep to do this:

ugrep "[\x{1F1FF}-\x{1F600}]" contact_names.tokens  

ugrep matches Unicode patterns by default (disabled with option -U). The regular expression syntax is POSIX ERE compliant, extended with Unicode character classes, lazy quantifiers, and negative patterns to skip unwanted pattern matches to produce more precise results.

ugrep searches UTF-encoded input when UTF BOM (byte order mark) are present and ASCII and UTF-8 when no UTF BOM is present. Option --encoding permits many other file formats to be searched, such as ISO-8859-1, EBCDIC, and code pages 437, 850, 858, 1250 to 1258.

ugrep searches text and binary files and produces hexdumps for binary matches.

The Unicode ranges for emojis is larger than the range 1F1FF+U to 1F600+U. See the official Unicode 12 publication https://unicode.org/emoji/charts-12.0/full-emoji-list.html

like image 39
Dr. Alex RE Avatar answered Oct 01 '22 10:10

Dr. Alex RE