I have a file 1.txt
$ cat 1.txt
page1
Ñage1
But:
$ head -n1 1.txt | file -i -
/dev/stdin: text/plain; charset=us-ascii
$ head -n2 1.txt | tail -n1 | file -i -
/dev/stdin: text/plain; charset=utf-8
Strings have different charset. Because of it I can't get unique string with the method i know:
$ cat 1.txt | sort | uniq -c | sort -rn
1 Ñage1
1 page1
So, can you help me to find the way how to get only unique string in my situation? P.S. Prefer solutions only with linux command line/bash/awk. But if you have the solution in another programming language, I'd like it too.
Upd. awk '!a[$0]++' Input_file
don't work, pic:
A cursory examination of what we have here:
$ cat 1.txt
page1
Ñage1
$ hd 1.txt
00000000 70 61 67 65 31 0a d1 80 61 67 65 31 0a |page1...age1.|
0000000d
As noted in the comments to the question, that second "Ñage1" is indeed distinct from the previous "page1" for a reason: that's not a Latin p
, it's a Cyrillic Ñ
, so a uniqueness filter should call them out as separate unless you normalize the text beforehand.
iconv won't do the trick here. uconv (e.g. apt install icu-devtools
on Debian/Ubuntu) will get you close, but its transliteration mappings are based on phonetics rather than lookalike characters, so when we transliterate this example, the Cyrillic Ñ
becomes a Latin r
:
$ uconv -x Cyrillic-Latin 1.txt
page1
rage1
See also these more complex uconv
commands, which have similar results.
The ICU uconv man page states
uconv can also run the specified transliteration on the transcoded data, in which case transliteration will happen as an intermediate step, after the data have been transcoded to Unicode. The transliteration can be either a list of semicolon-separated transliterator names, or an arbitrarily complex set of rules in the ICU transliteration rules format.
This implies that somebody could use the "ICU transliteration rules format" to specify a lookalike character mapping. Of course, at that rate, you could use whatever language you want.
I also tried perl's Text::Unidecode, but that has its own (similar) issues:
$ perl -Mutf8 -MText::Unidecode -pe '$_ = unidecode($_)' 1.txt
page1
NEURage1
That might work better in some cases, but obviously this isn't one of them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With