How to count the number of unique characters in a file?

Tags:

Given a file in UTF-8, containing characters in various languages, how can I obtain a count of the number of unique characters it contains, while excluding a select number of symbols (e.g.: "!", "@", "#", ".") from this count?

603

asked Mar 24 '12 01:03

Village

2 Answers

Here’s a bash solution. :)

bash$ perl -CSD -ne 'BEGIN { $s{$_}++ for split //, q(!@#.) }
                     $s{$_}++ || $c++ for split //;
                     END { print "$c\n" }' *.utf8

answered Oct 04 '22 22:10

tchrist

In Python:

import itertools, codecs

predicate = set('!@#.').__contains__
unique_char_count = len(set(itertools.ifilterfalse(
                      predicate, itertools.chain.from_iterable(codecs.open(filename, encoding="UTF-8")))))

When you iterate over a file, you get lines. chain joins them together, so iterating over it you get characters. ifilterfalse eliminates the characters that meet the condition, with the condition defined as membership in a set of the disallowed characters.

Without itertools:

import codecs
disallowed = set('!@#.')
unique_char_count = len(set(char for line in codecs.open(filename, encoding="UTF-8") for char in line 
                              if char not in disallowed))

Using set operations:

import codecs
unique = set()
any(unique.update(line) for line in codecs.open(filename, encoding="UTF-8"))
unique.difference_update('!@#.')
unique_char_count = len(unique)

answered Oct 04 '22 22:10

agf

Related questions
                            
                                python Postgresql CREATE DATABASE IF NOT EXISTS is error
                            
                                How to store dictionary and list in python config file?
                            
                                How set a particular cell value in pandas?
                            
                                Tensorflow crash with CUDNN_STATUS_ALLOC_FAILED
                            
                                Extract time from datetime for comparison in pandas
                            
                                Difference between list = [] vs. list.clear()
                            
                                Is there any usage of self-referential lists or circular reference in list, eg. appending a list to itself
                            
                                Dictionaries in Python
                            
                                Some Basic Python Questions
                            
                                List of installed fonts OS X / C
                            
                                nesting python list comprehensions to construct a list of lists
                            
                                Splitting a hex string into a list in Python - How?
                            
                                simulate private variables in python [duplicate]
                            
                                How do I efficiently fill a file with null data from python?
                            
                                geodjango syncdb errors. From geodjango tutorial
                            
                                Is there a Python module to detect month or day in string?
                            
                                Python Multiprocessing queue
                            
                                python temporary files
                            
                                List comprehension in pure BASH?
                            
                                Pythonic way of checking if several elements are in a list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to count the number of unique characters in a file?

Tags:

python

bash

ruby

perl

Village

People also ask

2 Answers

tchrist

agf

Recent Activity

Donate For Us