Given a file in UTF-8, containing characters in various languages, how can I obtain a count of the number of unique characters it contains, while excluding a select number of symbols (e.g.: "!", "@", "#", ".") from this count?
Fill the array with all zeros. Count the number of 1's in hash[]. Let it be C. C is the final answer, the number of unique characters in the string.
To get exact character count of string, use printf, as opposed to echo, cat, or running wc -c directly on a file, because using echo, cat, etc will count a newline character, which will give you the amount of characters including the newline character.
Here’s a bash solution. :)
bash$ perl -CSD -ne 'BEGIN { $s{$_}++ for split //, q(!@#.) }
$s{$_}++ || $c++ for split //;
END { print "$c\n" }' *.utf8
In Python:
import itertools, codecs
predicate = set('!@#.').__contains__
unique_char_count = len(set(itertools.ifilterfalse(
predicate, itertools.chain.from_iterable(codecs.open(filename, encoding="UTF-8")))))
When you iterate over a file, you get lines. chain
joins them together, so iterating over it you get characters. ifilterfalse
eliminates the characters that meet the condition, with the condition defined as membership in a set of the disallowed characters.
Without itertools:
import codecs
disallowed = set('!@#.')
unique_char_count = len(set(char for line in codecs.open(filename, encoding="UTF-8") for char in line
if char not in disallowed))
Using set operations:
import codecs
unique = set()
any(unique.update(line) for line in codecs.open(filename, encoding="UTF-8"))
unique.difference_update('!@#.')
unique_char_count = len(unique)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With