grep for emojis in linux

Question

I am trying to grep across a list of tokens that include several non-ASCII characters. I want to match only emojis, other characters such as ð or ñ are fine. The unicode range for emojis appears to be U+1F600-U+1F1FF but when I search for it using grep this happens:

grep -P "[\x1F6-\x1F1]" contact_names.tokens                                                                                                                                                                                                                                
grep: range out of order in character class

https://unicode.org/emoji/charts/full-emoji-list.html#1f3f4_e0067_e0062_e0077_e006c_e0073_e007f

rad164 · Accepted Answer

You need to specify the code points with full value (not 1F6 but 1F600) and wrap them with curly braces. In addition, the first value must be smaller than the last value. So the regex should be "[\x{1F1FF}-\x{1F600}]".

The unicode range for emojis is, however, more complex than you assumed. The page you referred does not sort characters by code point and emojis are placed in many blocks. If you want to cover almost all of emoji:

grep -P "[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]"  contact_names.tokens

(The range is borrowed from Suhail Gupta's answer on a similar question)

If you need to allow/disallow specific emoji blocks, see sequence data on unicode.org. List of emoji on Wikipedia also show characters in ordered tables but it might not list latest ones.

Dr. Alex RE · Answer

You could use ugrep as a drop-in replacement for grep to do this:

ugrep "[\x{1F1FF}-\x{1F600}]" contact_names.tokens

ugrep matches Unicode patterns by default (disabled with option -U). The regular expression syntax is POSIX ERE compliant, extended with Unicode character classes, lazy quantifiers, and negative patterns to skip unwanted pattern matches to produce more precise results.

ugrep searches UTF-encoded input when UTF BOM (byte order mark) are present and ASCII and UTF-8 when no UTF BOM is present. Option --encoding permits many other file formats to be searched, such as ISO-8859-1, EBCDIC, and code pages 437, 850, 858, 1250 to 1258.

ugrep searches text and binary files and produces hexdumps for binary matches.

The Unicode ranges for emojis is larger than the range 1F1FF+U to 1F600+U. See the official Unicode 12 publication https://unicode.org/emoji/charts-12.0/full-emoji-list.html

grep for emojis in linux

Tags:

grep

unicode

emoji

Drivebyluna

2 Answers

rad164

Dr. Alex RE

Recent Activity

Donate For Us

grep for emojis in linux

Tags:

grep

unicode

emoji

Drivebyluna

2 Answers

rad164

Dr. Alex RE

Related questions

Recent Activity

Donate For Us