Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to check the character count of a file in python

I have a python code which reads many files. but some files are extremely large due to which i have errors coming in other codes. i want a way in which i can check for the character count of the files so that i avoid reading those extremely large files. Thanks.

like image 263
randeepsp Avatar asked Jan 06 '10 05:01

randeepsp


1 Answers

os.stat(filepath).st_size

Assuming by ‘characters’ you mean bytes. ETA:

i need total character count just like what the command 'wc filename' gives me unix

In which mode? wc on it own will give you a line, word and byte count (same as stat), not Unicode characters.

There is a switch -m which will use the locale's current encoding to convert bytes to Unicode and then count code-points: is that really what you want? It doesn't make any sense to decode into Unicode if all you are looking for is too-long files. If you really must:

import sys, codecs

def getUnicodeFileLength(filepath, charset= None):
    if charset is None:
        charset= sys.getfilesystemencoding()
    readerclass= codecs.getReader(charset)
    reader= readerclass(open(filepath, 'rb'), 'replace')
    nchar= 0
    while True:
        chars= reader.read(1024*32)  # arbitrary chunk size
        if chars=='':
            break
        nchar+= len(chars)
    reader.close()
    return nchar

sys.getfilesystemencoding() gets the locale encoding, reproducing what wc -m does. If you know the encoding yourself (eg. 'utf-8') then pass that in instead.

I don't think you want to do this.

like image 84
bobince Avatar answered Oct 18 '22 17:10

bobince