I am using listadmin to manage many mailman-based mailing lists. I have a long list of subjects and from addresses set up to block spam. Recently, I received smarter spam in the sense that it uses nice-looking Unicode characters, eg:
Subject: Al l the ad ult mov ies you' ve see n a r e nothing c ompari- ng t o our exx xci t i ng compilation of 13' 000 mov ies in HD t hat are a v ailable for y ou now!
or
Subject: HD qua lit y vi d eos an d pho to graph s o f ho t c hic ks
are here for u
Now I want to use a smart Perl regex to block that. Piping these subjects to hexdump revealed many characters are a FULLWIDTH LATIN SMALL LETTER. However, \p{FULLWIDTH LATIN SMALL LETTER}
doesn't work: Can't find Unicode property definition "FULLWIDTH LATIN SMALL LETTER"
So the question is: Is there a \p{something}
to match those fullwidth characters? Alternatively: is there another way to match those characters?
The page perlunicode
documents available unicode character classes. I found it as a reference in perlrebackslash, which documents special character classes and backslash sequences like \p{...}
in regexes.
The summary is that all but the most common property classes require a property type and a property value, which are separated by :
or =
. However, there does not seem to be a mention of fullwidth characters as a predefined property.
But there is the Block
/Blk
property, which can have Halfwidth and Fullwidth Forms
(U+FF00
–U+FFEF
) as value:
/\p{Block=Halfwidth and Fullwidth Forms}/
This will match on your input (tested on v16.3).
A useful tool for this is uniprops
.
$ uniprops U+FF41
U+FF41 ‹a› \N{FULLWIDTH LATIN SMALL LETTER A}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InHalfwidthAndFullwidthForms
Cased Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT
Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase
Halfwidth_And_Fullwidth_Forms Hex XDigit Hex_Digit ID_Continue IDC
ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower Lowercase
Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum
X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
X_POSIX_XDigit
As you can see, \p{Block=Halfwidth and Fullwidth Forms}
can also be written \p{In Halfwidth and Fullwidth Forms}
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With