How to get uniq strings with different charset

Question

I have a file 1.txt

$ cat 1.txt 
page1
Ñage1

But:

$ head -n1 1.txt | file -i -
/dev/stdin: text/plain; charset=us-ascii

$ head -n2 1.txt | tail -n1 | file -i -
/dev/stdin: text/plain; charset=utf-8

Strings have different charset. Because of it I can't get unique string with the method i know:

$ cat 1.txt | sort | uniq -c | sort -rn
      1 Ñage1
      1 page1

So, can you help me to find the way how to get only unique string in my situation? P.S. Prefer solutions only with linux command line/bash/awk. But if you have the solution in another programming language, I'd like it too.

Upd. awk '!a[$0]++' Input_file don't work, pic:

enter image description here

Adam Katz · Accepted Answer

A cursory examination of what we have here:

$ cat 1.txt
page1
Ñage1
$ hd 1.txt
00000000  70 61 67 65 31 0a d1 80  61 67 65 31 0a           |page1...age1.|
0000000d

As noted in the comments to the question, that second "Ñage1" is indeed distinct from the previous "page1" for a reason: that's not a Latin p, it's a Cyrillic Ñ, so a uniqueness filter should call them out as separate unless you normalize the text beforehand.

iconv won't do the trick here. uconv (e.g. apt install icu-devtools on Debian/Ubuntu) will get you close, but its transliteration mappings are based on phonetics rather than lookalike characters, so when we transliterate this example, the Cyrillic Ñ becomes a Latin r:

$ uconv -x Cyrillic-Latin 1.txt
page1
rage1

See also these more complex uconv commands, which have similar results.

The ICU uconv man page states

uconv can also run the specified transliteration on the transcoded data, in which case transliteration will happen as an intermediate step, after the data have been transcoded to Unicode. The transliteration can be either a list of semicolon-separated transliterator names, or an arbitrarily complex set of rules in the ICU transliteration rules format.

This implies that somebody could use the "ICU transliteration rules format" to specify a lookalike character mapping. Of course, at that rate, you could use whatever language you want.

I also tried perl's Text::Unidecode, but that has its own (similar) issues:

$ perl -Mutf8 -MText::Unidecode -pe '$_ = unidecode($_)' 1.txt
page1
NEURage1

That might work better in some cases, but obviously this isn't one of them.

How to get uniq strings with different charset

Tags:

linux

bash

character-encoding

awk

Viktor Khilin

1 Answers

Adam Katz

Recent Activity

Donate For Us

How to get uniq strings with different charset

Tags:

linux

bash

character-encoding

awk

Viktor Khilin

1 Answers

Adam Katz

Related questions

Recent Activity

Donate For Us