Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I substitute Unicode characters with ASCII in Perl?

I can do it in vim like so:

:%s/\%u2013/-/g

How do I do the equivalent in Perl? I thought this would do it but it doesn't seem to be working:

perl -i -pe 's/\x{2013}/-/g' my.dat
like image 754
stephenmm Avatar asked Feb 22 '10 06:02

stephenmm


People also ask

Can Unicode be converted to ASCII?

Save this answer. Show activity on this post. You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.

How do I get the ASCII value of A character in Perl?

The ord() function is an inbuilt function in Perl that returns the ASCII value of the first character of a string. This function takes a character string as a parameter and returns the ASCII value of the first character of this string.

How does Unicode relate to ASCII?

Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc.

Does Perl support Unicode?

While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.


2 Answers

For a generic solution, Text::Unidecode transliterate pretty much anything that's thrown at it into pure US-ASCII.

So in your case this would work:

perl -C -MText::Unidecode -n -i -e'print unidecode( $_)' unicode_text.txt

The -C is there to make sure the input is read as utf8

It converts this:

l'été est arrivé à peine après aôut
¿España es un paìs muy lindo?
some special chars: » « ® ¼ ¶ – – — Ṉ
Some greek letters: β ÷ Θ ¬ the α and ω (or is it Ω?)
hiragana? みせる です
Здравствуйте
السلام عليكم

into this:

l'ete est arrive a peine apres aout
?Espana es un pais muy lindo?
some special chars: >> << (r) 1/4 P - - -- N
Some greek letters: b / Th ! the a and o (or is it O?)
hiragana? miseru desu
Zdravstvuitie
lslm `lykm

The last one shows the limits of the module, which can't infer the vowels and get as-salaamu `alaykum from the original arabic. It's still pretty good I think

like image 115
mirod Avatar answered Oct 14 '22 21:10

mirod


This did the trick for me:

perl -C1 -i -pe 's/–/-/g' my.dat

Note that the first bar is the \x{2013} character itself.

like image 42
Leon Timmermans Avatar answered Oct 14 '22 23:10

Leon Timmermans