Just like this question, I need to convert html entities (e.g. &
) to UTF-8 (&
) while ignoring other UTF-8 characters. The difference is that in my case, I need to do this via the bash command line.
I can use a tool like recode
and run echo '&' | recode html..utf-8
which converts over to &
just fine, however with UTF-8 characters in the string, like in
echo 'Arabic & ٱلْعَرَبِيَّة' | recode html..utf-8
I get:
Arabic & Ù±ÙÙØ¹ÙØ±ÙØ¨ÙÙÙÙØ©
which, naturally, is not what I need. It should look like this at the end:
Arabic & ٱلْعَرَبِيَّة
Is there a way to do this without a bunch of messy and seemingly endless regex? Thanks
perl one-liner:
$ echo 'Arabic & ٱلْعَرَبِيَّة' | perl -CS -MHTML::Entities -ne 'print decode_entities($_)'
Arabic & ٱلْعَرَبِيَّة
Requires the HTML::Entities module, which is part of the larger HTML::Parser bundle. Install through your OS package manager or favorite CPAN client.
I had a similar problem when trying to recode a Portuguese text using recode
. This problem occurs because recode
assumes that the input text is encoded with ISO-8859-1 (Latin Alphabet Number 1).
To solve the problem I used recode
2 times in a sequence.
See this example in Portuguese:
echo 'Isto é uma simulação.' | recode --diacritics UTF-8..HTML | recode HTML..UTF-8;
Isto é uma simulação.
Note that I use --diacritics
to ignore characters like &
, <
, >
, '
. It is very important to prevent the &
character from being converted to &
. The documentation isn't clear, but you can see it in the source code.
In the first recode
command, the letters with diacritics are converted to their correspondent HTML entities:
echo 'Isto é uma simulação.' | recode --diacritics UTF-8..HTML;
Isto é uma simulação.
Note that é
was replaced with é
('e' with acute accent).
The second recode
command converts the HTML entities to UTF-8:
echo 'Isto é uma simulação.' | recode HTML..UTF-8;
Isto é uma simulação.
Note that é
was replaced with é
.
Your example would look like this:
echo 'Arabic & ٱلْعَرَبِيَّة' | recode --diacritics UTF-8..HTML | recode HTML..UTF-8
Arabic & ٱلْعَرَبِيَّة
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With