Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get uniq strings with different charset

I have a file 1.txt

$ cat 1.txt 
page1
рage1

But:

$ head -n1 1.txt | file -i -
/dev/stdin: text/plain; charset=us-ascii

$ head -n2 1.txt | tail -n1 | file -i -
/dev/stdin: text/plain; charset=utf-8

Strings have different charset. Because of it I can't get unique string with the method i know:

$ cat 1.txt | sort | uniq -c | sort -rn
      1 рage1
      1 page1

So, can you help me to find the way how to get only unique string in my situation? P.S. Prefer solutions only with linux command line/bash/awk. But if you have the solution in another programming language, I'd like it too.

Upd. awk '!a[$0]++' Input_file don't work, pic:

enter image description here

like image 736
Viktor Khilin Avatar asked Feb 16 '18 11:02

Viktor Khilin


1 Answers

A cursory examination of what we have here:

$ cat 1.txt
page1
рage1
$ hd 1.txt
00000000  70 61 67 65 31 0a d1 80  61 67 65 31 0a           |page1...age1.|
0000000d

As noted in the comments to the question, that second "рage1" is indeed distinct from the previous "page1" for a reason: that's not a Latin p, it's a Cyrillic р, so a uniqueness filter should call them out as separate unless you normalize the text beforehand.

iconv won't do the trick here. uconv (e.g. apt install icu-devtools on Debian/Ubuntu) will get you close, but its transliteration mappings are based on phonetics rather than lookalike characters, so when we transliterate this example, the Cyrillic р becomes a Latin r:

$ uconv -x Cyrillic-Latin 1.txt
page1
rage1

See also these more complex uconv commands, which have similar results.

The ICU uconv man page states

uconv can also run the specified transliteration on the transcoded data, in which case transliteration will happen as an intermediate step, after the data have been transcoded to Unicode. The transliteration can be either a list of semicolon-separated transliterator names, or an arbitrarily complex set of rules in the ICU transliteration rules format.

This implies that somebody could use the "ICU transliteration rules format" to specify a lookalike character mapping. Of course, at that rate, you could use whatever language you want.

I also tried perl's Text::Unidecode, but that has its own (similar) issues:

$ perl -Mutf8 -MText::Unidecode -pe '$_ = unidecode($_)' 1.txt
page1
NEURage1

That might work better in some cases, but obviously this isn't one of them.

like image 83
Adam Katz Avatar answered Oct 16 '22 17:10

Adam Katz