This topic has been addressed for text based emoticons at link1, link2, link3. However, I would like to do something slightly different than matching simple emoticons. I'm sorting through tweets that contain the emoticons' icons. The following unicode information contains just such emoticons: pdf.
Using a string with english words that also contains any of these emoticons from the pdf, I would like to be able to compare the number of emoticons to the number of words.
The direction that I was heading down doesn't seem to be the best option and I was looking for some help. As you can see in the script below, I was just planning to do the work from the command line:
$cat <file containing the strings with emoticons> | ./emo.py
emo.py psuedo script:
import re
import sys
for row in sys.stdin:
print row.decode('utf-8').encode("ascii","replace")
#insert regex to find the emoticons
if match:
#do some counting using .split(" ")
#print the counting
The problem that I'm running into is the decoding/encoding. I haven't found a good option for how to encode/decode the string so I can correctly find the icons. An example of the string that I want to search to find the number of words and emoticons is as follows:
"Smiley emoticon rocks! I like you."
The challenge: can you make a script that counts the number of words and emoticons in this string? Notice that the emoticons are both sitting next to the words with no space in between.
My solution includes the emoji
and regex
modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like π¨βπ©βπ¦βπ¦ once, although it consists of 4 emojis.
import emoji
import regex
def split_count(text):
emoji_counter = 0
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI for char in word):
emoji_counter += 1
# Remove from the given text the emojis
text = text.replace(word, '')
words_counter = len(text.split())
return emoji_counter, words_counter
Testing:
line = "hello π©πΎβπ emoji hello π¨βπ©βπ¦βπ¦ how are π you todayπ
π½π
π½"
counter = split_count(line)
print("Number of emojis - {}, number of words - {}".format(counter[0], counter[1]))
Output:
Number of emojis - 5, number of words - 7
First, there is no need to encode here at all. You're got a Unicode string, and the re
engine can handle Unicode, so just use it.
A character class can include a range of characters, by specifying the first and last with a hyphen in between. And you can specify Unicode characters that you don't know how to type with \U
escape sequences. So:
import re
s=u"Smiley emoticon rocks!\U0001f600 I like you.\U0001f601"
count = len(re.findall(ru'[\U0001f600-\U0001f650]', s))
Or, if the string is big enough that building up the whole findall
list seems wasteful:
emoticons = re.finditer(ru'[\U0001f600-\U0001f650]', s)
count = sum(1 for _ in emoticons)
Counting words, you can do separately:
wordcount = len(s.split())
If you want to do it all at once, you can use an alternation group:
word_and_emoticon_count = len(re.findall(ru'\w+|[\U0001f600-\U0001f650]', s))
As @strangefeatures points out, Python versions before 3.3 allowed "narrow Unicode" builds. And, for example, most CPython Windows builds are narrow. In narrow builds, characters can only be in the range U+0000
to U+FFFF
. There's no way to search for these characters, but that's OK, because they're don't exist to search for; you can just assume they don't exist if you get an "invalid range" error compiling the regexp.
Except, of course, that there's a good chance that wherever you're getting your actual strings from, they're UTF-16-BE or UTF-16-LE, so the characters do exist, they're just encoded into surrogate pairs. And you want to match those surrogate pairs, right? So you need to translate your search into a surrogate-pair search. That is, convert your high and low code points into surrogate pair code units, then (in Python terms) search for:
(lead == low_lead and lead != high_lead and low_trail <= trail <= DFFF or
lead == high_lead and lead != low_lead and DC00 <= trail <= high_trail or
low_lead < lead < high_lead and DC00 <= trail <= DFFF)
You can leave off the second condition in the last case if you're not worried about accepting bogus UTF-16.
If it's not obvious how that translates into regexp, here's an example for the range [\U0001e050-\U0001fbbf]
in UTF-16-BE:
(\ud838[\udc50-\udfff])|([\ud839-\ud83d].)|(\ud83e[\udc00-\udfbf])
Of course if your range is small enough that low_lead == high_lead
this gets simpler. For example, the original question's range can be searched with:
\ud83d[\ude00-\ude50]
One last trick, if you don't actually know whether you're going to get UTF-16-LE or UTF-16-BE (and the BOM is far away from the data you're searching): Because no surrogate lead or trail code unit is valid as a standalone character or as the other end of a pair, you can just search in both directions:
(\ud838[\udc50-\udfff])|([\ud839-\ud83d][\udc00-\udfff])|(\ud83e[\udc00-\udfbf])|
([\udc50-\udfff]\ud838)|([\udc00-\udfff][\ud839-\ud83d])|([\udc00-\udfbf]\ud83e)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With