dplyr filter condition to distinguish between unicode symbol and its unicode representation

Question

I am trying to filter the Symbol column based on whether it's of the form \uxxxx

This is easy visually, that is, some look like $, ¢, £, and others like \u058f, \u060b, \u07fe.

But I cannot seem to figure it out using stringi / dplyr

library(dplyr)
library(stringi)

df <- structure(list(Character = c("\u0024", "\u00A2", "\u00A3", 
                             "\u00A4", "\u00A5", "\u058F", "\u060B", "\u07FE", "\u07FF", 
                             "\u09F2", "\u09F3", "\u09FB", "\u0AF1", "\u0BF9", "\u0E3F", 
                             "\u17DB", "\u20A0", "\u20A1", "\u20A2", "\u20A3"), 
                     Symbol = c("$", "¢", "£", "¤", "¥", "\u058f", "\u060b", "\u07fe", "\u07ff", 
                                "৲", "৳", "\u09fb", "\u0af1", "\u0bf9", "฿", "៛", "₠", 
                                "₡", "₢", "₣")), row.names = c(NA, 20L), class = "data.frame")

   Character Symbol
1    \u0024      $
2    \u00A2      ¢
3    \u00A3      £
4    \u00A4      ¤
5    \u00A5      ¥
6    \u058F \u058f
7    \u060B \u060b
8    \u07FE \u07fe
9    \u07FF \u07ff
10   \u09F2      ৲
11   \u09F3      ৳
12   \u09FB \u09fb
13   \u0AF1 \u0af1
14   \u0BF9 \u0bf9
15   \u0E3F      ฿
16   \u17DB      ៛
17   \u20A0      ₠
18   \u20A1      ₡
19   \u20A2      ₢
20   \u20A3      ₣

What I've tried

I have tried using variations on nchar but haven't had luck


df$Symbol %>% nchar
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

df$Symbol %>% stri_unescape_unicode %>% nchar
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

df$Symbol %>% stri_escape_unicode %>% nchar
# [1] 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

Question

How can I filter on the Symbol column for all the rows of the form $, ¢, £ etc (and conversely for rows like \u058f, \u060b, \u07fe)?

Ritchie Sacramento · Accepted Answer

Edit:

The function glyphs_match() from the gdtools package is designed for this, however, using it didn't quite return the expected result. I'm using Lucida Console as my font and obtain the following output when using glyphs_match(). There seems to be one glyph that isn't rendered but for which the function returns TRUE. Perhaps other users can explain why that is the case.

df$glyph_match <- gdtools::glyphs_match(df$Symbol, fontfile = "C:\WINDOWS\Fonts\lucon.TTF")
    df

   Character   Symbol glyph_match
1    \u0024        $        TRUE
2    \u00A2        ¢        TRUE
3    \u00A3        £        TRUE
4    \u00A4        ¤        TRUE
5    \u00A5        ¥        TRUE
6    \u058F <U+058F>       FALSE
7    \u060B <U+060B>       FALSE
8    \u07FE <U+07FE>       FALSE
9    \u07FF <U+07FF>       FALSE
10   \u09F2 <U+09F2>       FALSE
11   \u09F3 <U+09F3>       FALSE
12   \u09FB <U+09FB>       FALSE
13   \u0AF1 <U+0AF1>       FALSE
14   \u0BF9 <U+0BF9>       FALSE
15   \u0E3F <U+0E3F>       FALSE
16   \u17DB <U+17DB>       FALSE
17   \u20A0 <U+20A0>       FALSE
18   \u20A1        ¢        TRUE
19   \u20A2 <U+20A2>       FALSE
20   \u20A3 <U+20A3>        TRUE

Earlier answer - may only work on Windows:

There will be variation depending on your font/system, for example, when running your code my output doesn't match what you've provided:

df <- structure(list(Character = c("\u0024", "\u00A2", "\u00A3", 
                             "\u00A4", "\u00A5", "\u058F", "\u060B", "\u07FE", "\u07FF", 
                             "\u09F2", "\u09F3", "\u09FB", "\u0AF1", "\u0BF9", "\u0E3F", 
                             "\u17DB", "\u20A0", "\u20A1", "\u20A2", "\u20A3"), 
                     Symbol = c("$", "¢", "£", "¤", "¥", "\u058f", "\u060b", "\u07fe", "\u07ff", 
                                "৲", "৳", "\u09fb", "\u0af1", "\u0bf9", "฿", "៛", "₠", 
                                "₡", "₢", "₣")), row.names = c(NA, 20L), class = "data.frame")

df
   Character   Symbol
1    \u0024        $
2    \u00A2        ¢
3    \u00A3        £
4    \u00A4        ¤
5    \u00A5        ¥
6    \u058F <U+058F>
7    \u060B <U+060B>
8    \u07FE <U+07FE>
9    \u07FF <U+07FF>
10   \u09F2 <U+09F2>
11   \u09F3 <U+09F3>
12   \u09FB <U+09FB>
13   \u0AF1 <U+0AF1>
14   \u0BF9 <U+0BF9>
15   \u0E3F <U+0E3F>
16   \u17DB <U+17DB>
17   \u20A0 <U+20A0>
18   \u20A1        ¢
19   \u20A2 <U+20A2>
20   \u20A3 <U+20A3>

But one crude way of capturing if the glyph exists is:

 nchar(capture.output(cat(df$Symbol, sep = "
"))) == 1

[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18]  TRUE FALSE FALSE

So the glyphs can be filtered by:

library(dplyr)

df %>%
  filter(nchar(capture.output(cat(Symbol, sep = "
"))) == 1)

  Character Symbol
1   \u0024      $
2   \u00A2      ¢
3   \u00A3      £
4   \u00A4      ¤
5   \u00A5      ¥
6   \u20A1      ¢

jared_mamrot · Answer

Use as.character.POSIXt to 'render' symbols and pad with spaces. Unicode characters in the form "\uxxxx" will be printed as a single character and all others will be larger; then you can filter according to length:

# To keep 'single char' symbols e.g. "$":
df %>% filter(nchar(as.character.POSIXt(Symbol)) >= 2)

# Or for 'unicode format' symbols e.g. "\u07fe":
df %>% filter(nchar(as.character.POSIXt(Symbol)) == 1)

If you have a long string as a 'symbol' (e.g. "aaaaaaaaaa₣") the padding will be increased and need to be accounted for e.g.

# To keep 'single char' symbols e.g. "$":
df %>% filter(nchar(as.character.POSIXt(Symbol)) >= 11)

# Or for 'unicode format' symbols e.g. "\u07fe":
df %>% filter(nchar(as.character.POSIXt(Symbol)) <= 10)

dplyr filter condition to distinguish between unicode symbol and its unicode representation

Tags:

r

dplyr

stringr

stringi

What I've tried

Question

stevec

2 Answers

Ritchie Sacramento

jared_mamrot

Recent Activity

Donate For Us

dplyr filter condition to distinguish between unicode symbol and its unicode representation

Tags:

r

dplyr

stringr

stringi

What I've tried

Question

stevec

2 Answers

Ritchie Sacramento

jared_mamrot

Related questions

Recent Activity

Donate For Us