Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I recognise currency symbols in Perl?

I'm having some trouble with this.

I am reading in some text and trying to extract prices from it. That I am fine with, but I am trying to write some code to determine the name of the currency from the symbol in the text with if statements similar to these

if ($curr eq "\$"){
print CURRENCY "Currency: Dollars($curr)\n";
}
else {if($curr eq "£"){
print CURRENCY  "Currency: Pounds($curr)\n";
}
else {if($curr eq "€"){
print CURRENCY  "Currency: Euros($curr)\n";
}

Now this works for $ (which has to be escaped obviously), but not for the Pound symbol or the Euro symbol. I assume this is something to do with Unicode encoding or something similar from what my attempts to google the issue brought up but nothing I found was much assistance. I wonder if anyone can help me here!

like image 826
Drake Avatar asked Dec 04 '22 10:12

Drake


2 Answers

How to talk about Unicode characters

It sounds like you are having a problem with encodings. You seem to have Unicode characters in your Perl program’s source code. You need to use this pragma (that’s a fancy way of saying a lowercase module name which acts like compiler directive):

use utf8;

Put that at the top of your program, and then make sure that you are actually editing it with an editor that knows to save it as UTF-8 text. You able to use the file command if you have it to verify that it says that that file is in UTF-8.

An alternative that doesn’t require your Perl source to be in UTF-8 is to use code point numbers or Unicode character names instead of literals. To get named Unicode characters, use this pragma:

use charnames qw[ :full ];

Now you can use the "\N{…}" notation to talk about named characters:

$pound_sign = "\N{POUND SIGN}";
$euro_sign  = "\N{EURO SIGN}";

Another way is to use the numeric code point, if you know it:

$pound_sign = chr(163);
$euro_sign  = chr(0x20AC);

You can use the exact number in strings and patterns if you want, too:

if ($text =~ /\xA3/) { … }     # POUND SIGN

if ($text =~ /\x{20AC}/) { … } # EURO SIGN

That will free you from having to put non-ASCII in your Perl source, which is probably a good idea even though literal magic numbers like that probably isn’t. However, you still have to account for your data source being in some encoding or another. I’m going to assume it’s in some Unicode encoding, probably UTF-8. I hope it’s not CESU-8 from Oracle or Java’s “modified UTF-8”.

The Unicode ‘Currency_Symbol’ Property

The only right way to detect any arbitrary currency symbol that is represented in text by a single Unicode character is by detecting the Unicode currency symbol property, \p{Sc} or \p{Currency_Symbol}.

Those are Unicode properties, which are character classes you can use in regexes.

You’ll want to say something like

if ($curr =~ /^\p{Sc}$/) { ... }

But for that to work, you have to have read in $curr from an input source in the :utf8 encoding. In your own source, you’d say:

use utf8;

And in a file you open you’d say one of these:

# put at the top of your file and be done with it
use open qw[ :std :utf8 ];

# or else when opening a new handle
open(my $new_handle, "< :encoding(utf8)", "/path/to/file")
    || die "can't open /path/to/file: $!";

# if handle already opened, then just
binmode($already_opened_handle, ":encoding(utf8)")
     || die "can't binmode: $!";

Technically, you should probably use :encoding(utf8) except for the use utf8; in your own source file, so that you can’t get spoofed. Don’t ask. ☹

If you’re using a module like CGI.pm or XML::Simple, it should just work — but it depends.

Properties of Currency Symbol characters

Here’s the full deal:

% uniprops -vag € 'POUND SIGN'
U+20AC ‹€› \N{ EURO SIGN }:
    \p{\pS} \p{\p{Sc}}
    \p{All} \p{Any} \p{Assigned} \p{InCurrencySymbols} \p{Common} \p{Zyyy} \p{Currency_Symbol} \p{Sc} \p{S} \p{Gr_Base} \p{Grapheme_Base} \p{Graph}
       \p{GrBase} \p{Print} \p{Symbol}
    \p{Age:2.1} \p{Bidi_Class:ET} \p{Bidi_Class=European_Terminator} \p{Bidi_Class:European_Terminator} \p{Bc=ET} \p{Block:Currency_Symbols}
       \p{Canonical_Combining_Class:0} \p{Canonical_Combining_Class=Not_Reordered} \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR}
       \p{Canonical_Combining_Class:NR} \p{Script=Common} \p{General_Category=Currency_Symbol} \p{Decomposition_Type:None} \p{Dt=None} \p{East_Asian_Width:A}
       \p{East_Asian_Width=Ambiguous} \p{East_Asian_Width:Ambiguous} \p{Ea=A} \p{General_Category:Currency_Symbol} \p{Gc=Sc} \p{General_Category:S}
       \p{General_Category=Symbol} \p{General_Category:Sc} \p{General_Category:Symbol} \p{Gc=S} \p{Grapheme_Cluster_Break:Other} \p{GCB=XX}
       \p{Grapheme_Cluster_Break:XX} \p{Grapheme_Cluster_Break=Other} \p{Hangul_Syllable_Type:NA} \p{Hangul_Syllable_Type=Not_Applicable}
       \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:No_Joining_Group} \p{Jg=NoJoiningGroup} \p{Joining_Type:Non_Joining} \p{Jt=U}
       \p{Joining_Type:U} \p{Joining_Type=Non_Joining} \p{Line_Break:PR} \p{Line_Break=Prefix_Numeric} \p{Line_Break:Prefix_Numeric} \p{Lb=PR}
       \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:2.1} \p{In=2.1} \p{Present_In:3.0} \p{In=3.0} \p{Present_In:3.1}
       \p{In=3.1} \p{Present_In:3.2} \p{In=3.2} \p{Present_In:4.0} \p{In=4.0} \p{Present_In:4.1} \p{In=4.1} \p{Present_In:5.0} \p{In=5.0} \p{Present_In:5.1}
       \p{In=5.1} \p{Present_In:5.2} \p{In=5.2} \p{Script:Common} \p{Sc=Zyyy} \p{Script:Zyyy} \p{Sentence_Break:Other} \p{SB=XX} \p{Sentence_Break:XX}
       \p{Sentence_Break=Other} \p{Word_Break:Other} \p{WB=XX} \p{Word_Break:XX} \p{Word_Break=Other}
U+00A3 ‹£› \N{ POUND SIGN }:
    \p{\pS} \p{\p{Sc}}
    \p{All} \p{Any} \p{Assigned} \p{InLatin1} \p{Common} \p{Zyyy} \p{Currency_Symbol} \p{Sc} \p{S} \p{Gr_Base} \p{Grapheme_Base} \p{Graph} \p{GrBase}
       \p{Pat_Syn} \p{Pattern_Syntax} \p{PatSyn} \p{Print} \p{Symbol}
    \p{Age:1.1} \p{Bidi_Class:ET} \p{Bidi_Class=European_Terminator} \p{Bidi_Class:European_Terminator} \p{Bc=ET} \p{Block:Latin_1}
       \p{Block=Latin_1_Supplement} \p{Block:Latin_1_Supplement} \p{Blk=Latin1} \p{Canonical_Combining_Class:0} \p{Canonical_Combining_Class=Not_Reordered}
       \p{Canonical_Combining_Class:Not_Reordered} \p{Ccc=NR} \p{Canonical_Combining_Class:NR} \p{Script=Common} \p{General_Category=Currency_Symbol}
       \p{Decomposition_Type:None} \p{Dt=None} \p{East_Asian_Width:Na} \p{East_Asian_Width=Narrow} \p{East_Asian_Width:Narrow} \p{Ea=Na}
       \p{General_Category:Currency_Symbol} \p{Gc=Sc} \p{General_Category:S} \p{General_Category=Symbol} \p{General_Category:Sc} \p{General_Category:Symbol}
       \p{Gc=S} \p{Grapheme_Cluster_Break:Other} \p{GCB=XX} \p{Grapheme_Cluster_Break:XX} \p{Grapheme_Cluster_Break=Other} \p{Hangul_Syllable_Type:NA}
       \p{Hangul_Syllable_Type=Not_Applicable} \p{Hangul_Syllable_Type:Not_Applicable} \p{Hst=NA} \p{Joining_Group:No_Joining_Group} \p{Jg=NoJoiningGroup}
       \p{Joining_Type:Non_Joining} \p{Jt=U} \p{Joining_Type:U} \p{Joining_Type=Non_Joining} \p{Line_Break:PR} \p{Line_Break=Prefix_Numeric}
       \p{Line_Break:Prefix_Numeric} \p{Lb=PR} \p{Numeric_Type:None} \p{Nt=None} \p{Numeric_Value:NaN} \p{Nv=NaN} \p{Present_In:1.1} \p{Age=1.1} \p{In=1.1}
       \p{Present_In:2.0} \p{In=2.0} \p{Present_In:2.1} \p{In=2.1} \p{Present_In:3.0} \p{In=3.0} \p{Present_In:3.1} \p{In=3.1} \p{Present_In:3.2} \p{In=3.2}
       \p{Present_In:4.0} \p{In=4.0} \p{Present_In:4.1} \p{In=4.1} \p{Present_In:5.0} \p{In=5.0} \p{Present_In:5.1} \p{In=5.1} \p{Present_In:5.2} \p{In=5.2}
       \p{Script:Common} \p{Sc=Zyyy} \p{Script:Zyyy} \p{Sentence_Break:Other} \p{SB=XX} \p{Sentence_Break:XX} \p{Sentence_Break=Other} \p{Word_Break:Other}
       \p{WB=XX} \p{Word_Break:XX} \p{Word_Break=Other}

Finding all \p{Sc} characters

And here are all 46 of the Unicode characters with the Sc a.k.a. Currency_Symbol property, current as of Unicode 5.2: (sorry for the formatting issues; I believe it’s due to directionality)

 % unichars -a '\p{Sc}' | wc -l
       46

 % unichars -a '\p{Sc}'
 $      36 000024 DOLLAR SIGN
 ¢     162 0000A2 CENT SIGN
 £     163 0000A3 POUND SIGN
 ¤     164 0000A4 CURRENCY SIGN
 ¥     165 0000A5 YEN SIGN
 ؋    1547 00060B AFGHANI SIGN
 ৲    2546 0009F2 BENGALI RUPEE MARK
 ৳    2547 0009F3 BENGALI RUPEE SIGN
 ৻    2555 0009FB BENGALI GANDA MARK
 ૱    2801 000AF1 GUJARATI RUPEE SIGN
 ௹    3065 000BF9 TAMIL RUPEE SIGN
 ฿    3647 000E3F THAI CURRENCY SYMBOL BAHT
 ៛    6107 0017DB KHMER CURRENCY SYMBOL RIEL
 ₠    8352 0020A0 EURO-CURRENCY SIGN
 ₡    8353 0020A1 COLON SIGN
 ₢    8354 0020A2 CRUZEIRO SIGN
 ₣    8355 0020A3 FRENCH FRANC SIGN
 ₤    8356 0020A4 LIRA SIGN
 ₥    8357 0020A5 MILL SIGN
 ₦    8358 0020A6 NAIRA SIGN
 ₧    8359 0020A7 PESETA SIGN
 ₨    8360 0020A8 RUPEE SIGN
 ₩    8361 0020A9 WON SIGN
 ₪    8362 0020AA NEW SHEQEL SIGN
 ₫    8363 0020AB DONG SIGN
 €    8364 0020AC EURO SIGN
 ₭    8365 0020AD KIP SIGN
 ₮    8366 0020AE TUGRIK SIGN
 ₯    8367 0020AF DRACHMA SIGN
 ₰    8368 0020B0 GERMAN PENNY SIGN
 ₱    8369 0020B1 PESO SIGN
 ₲    8370 0020B2 GUARANI SIGN
 ₳    8371 0020B3 AUSTRAL SIGN
 ₴    8372 0020B4 HRYVNIA SIGN
 ₵    8373 0020B5 CEDI SIGN
 ₶    8374 0020B6 LIVRE TOURNOIS SIGN
 ₷    8375 0020B7 SPESMILO SIGN
 ₸    8376 0020B8 TENGE SIGN
 ꠸   43064 00A838 NORTH INDIC RUPEE MARK
 ﷼   65020 00FDFC RIAL SIGN
 ﹩   65129 00FE69 SMALL DOLLAR SIGN
 $   65284 00FF04 FULLWIDTH DOLLAR SIGN
 ¢   65504 00FFE0 FULLWIDTH CENT SIGN
 £   65505 00FFE1 FULLWIDTH POUND SIGN
 ¥   65509 00FFE5 FULLWIDTH YEN SIGN
 ₩   65510 00FFE6 FULLWIDTH WON SIGN

Whereas here are the ones in the BMP that weren’t in Unicode 4.1 yet; notice how you can combine properties and negations to pull sets of Unicode characters.

% unichars --bmp '\p{Sc}' '\P{In:4.1}'
 ৻  2555 09FB BENGALI GANDA MARK
 ₶  8374 20B6 LIVRE TOURNOIS SIGN
 ₷  8375 20B7 SPESMILO SIGN
 ₸  8376 20B8 TENGE SIGN
 ꠸ 43064 A838 NORTH INDIC RUPEE MARK

If you don’t have unichars and uniprops on your system, just send me mail, and I’ll send them to you. They’re little tiny utility programs in pure Perl, no extra modules required.

like image 67
tchrist Avatar answered Dec 24 '22 14:12

tchrist


Put this at the top of your code:

use utf8;

As described in the documentation, it indicates that the code includes utf8-encoded strings.

like image 44
Ether Avatar answered Dec 24 '22 15:12

Ether