Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the number of unique characters in a file?

Given a file in UTF-8, containing characters in various languages, how can I obtain a count of the number of unique characters it contains, while excluding a select number of symbols (e.g.: "!", "@", "#", ".") from this count?

like image 603
Village Avatar asked Mar 24 '12 01:03

Village


People also ask

How do you count unique characters in a string?

Fill the array with all zeros. Count the number of 1's in hash[]. Let it be C. C is the final answer, the number of unique characters in the string.

How do I count the number of characters in a file?

To get exact character count of string, use printf, as opposed to echo, cat, or running wc -c directly on a file, because using echo, cat, etc will count a newline character, which will give you the amount of characters including the newline character.


2 Answers

Here’s a bash solution. :)

bash$ perl -CSD -ne 'BEGIN { $s{$_}++ for split //, q(!@#.) }
                     $s{$_}++ || $c++ for split //;
                     END { print "$c\n" }' *.utf8
like image 62
tchrist Avatar answered Oct 04 '22 22:10

tchrist


In Python:

import itertools, codecs

predicate = set('!@#.').__contains__
unique_char_count = len(set(itertools.ifilterfalse(
                      predicate, itertools.chain.from_iterable(codecs.open(filename, encoding="UTF-8")))))

When you iterate over a file, you get lines. chain joins them together, so iterating over it you get characters. ifilterfalse eliminates the characters that meet the condition, with the condition defined as membership in a set of the disallowed characters.

Without itertools:

import codecs
disallowed = set('!@#.')
unique_char_count = len(set(char for line in codecs.open(filename, encoding="UTF-8") for char in line 
                              if char not in disallowed))

Using set operations:

import codecs
unique = set()
any(unique.update(line) for line in codecs.open(filename, encoding="UTF-8"))
unique.difference_update('!@#.')
unique_char_count = len(unique)
like image 24
agf Avatar answered Oct 04 '22 22:10

agf