In the program I maintain it is done as in: <pre class="prettyprint"><code># count the files in the archive length = 0 command = ur'"%s" l -slt "%s"' % (u'path/to/7z.exe', srcFile) ins, err = Popen(command, stdout=PIPE, stdin=PIPE, startupinfo=startupinfo).communicate() ins = StringIO.StringIO(ins) for line in ins: length += 1 ins.close() </code></pre> <ol> <li>Is it really the only way ? I can't seem to find any other command but it seems a bit odd that I can't just ask for the number of files</li> <li> What about error checking ? Would it be enough to modify this to: <pre class="prettyprint"><code>proc = Popen(command, stdout=PIPE, stdin=PIPE, startupinfo=startupinfo) out = proc.stdout # ... count returncode = proc.wait() if returncode: raise Exception(u'Failed reading number of files from ' + srcFile) </code></pre> or should I actually parse the output of Popen ? </li> </ol> EDIT: interested in 7z, rar, zip archives (that are supported by 7z.exe) - but 7z and zip would be enough for starters

To count the number of archive members in a zip archive in Python: <pre class="prettyprint"><code>#!/usr/bin/env python import sys from contextlib import closing from zipfile import ZipFile with closing(ZipFile(sys.argv[1])) as archive: count = len(archive.infolist()) print(count) </code></pre> It may use <code>zlib</code>, <code>bz2</code>, <code>lzma</code> modules if available, to decompress the archive. <hr> To count the number of regular files in a tar archive: <pre class="prettyprint"><code>#!/usr/bin/env python import sys import tarfile with tarfile.open(sys.argv[1]) as archive: count = sum(1 for member in archive if member.isreg()) print(count) </code></pre> It may support <code>gzip</code>, <code>bz2</code> and <code>lzma</code> compression depending on version of Python. You could find a 3rd-party module that would provide a similar functionality for 7z archives. <hr> To get the number of files in an archive using <code>7z</code> utility: <pre class="prettyprint"><code>import os import subprocess def count_files_7z(archive): s = subprocess.check_output(["7z", "l", archive], env=dict(os.environ, LC_ALL="C")) return int(re.search(br'(\d+)\s+files,\s+\d+\s+folders$', s).group(1)) </code></pre> Here's version that may use less memory if there are many files in the archive: <pre class="prettyprint"><code>import os import re from subprocess import Popen, PIPE, CalledProcessError def count_files_7z(archive): command = ["7z", "l", archive] p = Popen(command, stdout=PIPE, bufsize=1, env=dict(os.environ, LC_ALL="C")) with p.stdout: for line in p.stdout: if line.startswith(b'Error:'): # found error error = line + b"".join(p.stdout) raise CalledProcessError(p.wait(), command, error) returncode = p.wait() assert returncode == 0 return int(re.search(br'(\d+)\s+files,\s+\d+\s+folders', line).group(1)) </code></pre> Example: <pre class="prettyprint"><code>import sys try: print(count_files_7z(sys.argv[1])) except CalledProcessError as e: getattr(sys.stderr, 'buffer', sys.stderr).write(e.output) sys.exit(e.returncode) </code></pre> <hr> To count the number of lines in the output of a generic subprocess: <pre class="prettyprint"><code>from functools import partial from subprocess import Popen, PIPE, CalledProcessError p = Popen(command, stdout=PIPE, bufsize=-1) with p.stdout: read_chunk = partial(p.stdout.read, 1 << 15) count = sum(chunk.count(b'\n') for chunk in iter(read_chunk, b'')) if p.wait() != 0: raise CalledProcessError(p.returncode, command) print(count) </code></pre> It supports unlimited output. <hr> <blockquote> Could you explain why buffsize=-1 (as opposed to buffsize=1 in your previous answer: stackoverflow.com/a/30984882/281545) </blockquote> <code>bufsize=-1</code> means use the default I/O buffer size instead of <code>bufsize=0</code> (unbuffered) on Python 2. It is a performance boost on Python 2. It is default on the recent Python 3 versions. You might get a short read (lose data) if on some earlier Python 3 versions where <code>bufsize</code> is not changed to <code>bufsize=-1</code>. This answer reads in chunks and therefore the stream is fully buffered for efficiency. The solution you've linked is line-oriented. <code>bufsize=1</code> means "line buffered". There is minimal difference from <code>bufsize=-1</code> otherwise. <blockquote> and also what the read_chunk = partial(p.stdout.read, 1 << 15) buys us ? </blockquote> It is equivalent to <code>read_chunk = lambda: p.stdout.read(1<<15)</code> but provides more introspection in general. It is used to implement <code>wc -l</code> in Python efficiently.

Since I already have 7z.exe bundled with the app and I surely want to avoid a third party lib, while I do need to parse rar and 7z archives I think I will go with: <pre class="prettyprint"><code>regErrMatch = re.compile(u'Error:', re.U).match # needs more testing r"""7z list command output is of the form: Date Time Attr Size Compressed Name ------------------- ----- ------------ ------------ ------------------------ 2015-06-29 21:14:04 ....A <size> <filename> where ....A is the attribute value for normal files, ....D for directories """ reFileMatch = re.compile(ur'(\d|:|-|\s)*\.\.\.\.A', re.U).match def countFilesInArchive(srcArch, listFilePath=None): """Count all regular files in srcArch (or only the subset in listFilePath).""" # https://stackoverflow.com/q/31124670/281545 command = ur'"%s" l -scsUTF-8 -sccUTF-8 "%s"' % ('compiled/7z.exe', srcArch) if listFilePath: command += u' @"%s"' % listFilePath proc = Popen(command, stdout=PIPE, startupinfo=startupinfo, bufsize=-1) length, errorLine = 0, [] with proc.stdout as out: for line in iter(out.readline, b''): line = unicode(line, 'utf8') if errorLine or regErrMatch(line): errorLine.append(line) elif reFileMatch(line): length += 1 returncode = proc.wait() if returncode or errorLine: raise StateError(u'%s: Listing failed\n' + srcArch + u'7z.exe return value: ' + str(returncode) + u'\n' + u'\n'.join([x.strip() for x in errorLine if x.strip()])) return length </code></pre> Error checking as in Python Popen - wait vs communicate vs CalledProcessError by @JFSebastien <hr> My final(ish) based on accepted answer - unicode may not be needed, kept it for now as I use it everywhere. Also kept regex (which I may expand, I have seen things like <code>re.compile(u'^(Error:.+|.+ Data Error?|Sub items Errors:.+)',re.U)</code>. Will have to look into check_output and CalledProcessError. <pre class="prettyprint"><code>def countFilesInArchive(srcArch, listFilePath=None): """Count all regular files in srcArch (or only the subset in listFilePath).""" command = [exe7z, u'l', u'-scsUTF-8', u'-sccUTF-8', srcArch] if listFilePath: command += [u'@%s' % listFilePath] proc = Popen(command, stdout=PIPE, stdin=PIPE, # stdin needed if listFilePath startupinfo=startupinfo, bufsize=1) errorLine = line = u'' with proc.stdout as out: for line in iter(out.readline, b''): # consider io.TextIOWrapper line = unicode(line, 'utf8') if regErrMatch(line): errorLine = line + u''.join(out) break returncode = proc.wait() msg = u'%s: Listing failed\n' % srcArch.s if returncode or errorLine: msg += u'7z.exe return value: ' + str(returncode) + u'\n' + errorLine elif not line: # should not happen msg += u'Empty output' else: msg = u'' if msg: raise StateError(msg) # consider using CalledProcessError # number of files is reported in the last line - example: # 3534900 325332 75 files, 29 folders return int(re.search(ur'(\d+)\s+files,\s+\d+\s+folders', line).group(1)) </code></pre> Will edit this with my findings.

How to programmatically count the number of files in an archive using python

Tags:

python

subprocess

python-2.7

popen

7zip

In the program I maintain it is done as in:

# count the files in the archive
length = 0
command = ur'"%s" l -slt "%s"' % (u'path/to/7z.exe', srcFile)
ins, err = Popen(command, stdout=PIPE, stdin=PIPE,
                 startupinfo=startupinfo).communicate()
ins = StringIO.StringIO(ins)
for line in ins: length += 1
ins.close()

Is it really the only way ? I can't seem to find any other command but it seems a bit odd that I can't just ask for the number of files

What about error checking ? Would it be enough to modify this to:

proc = Popen(command, stdout=PIPE, stdin=PIPE,
             startupinfo=startupinfo)
out = proc.stdout
# ... count
returncode = proc.wait()
if returncode:
    raise Exception(u'Failed reading number of files from ' + srcFile)

or should I actually parse the output of Popen ?

EDIT: interested in 7z, rar, zip archives (that are supported by 7z.exe) - but 7z and zip would be enough for starters

542

asked Jun 29 '15 20:06

Mr_and_Mrs_D

2 Answers

To count the number of archive members in a zip archive in Python:

#!/usr/bin/env python
import sys
from contextlib import closing
from zipfile import ZipFile

with closing(ZipFile(sys.argv[1])) as archive:
    count = len(archive.infolist())
print(count)

It may use zlib, bz2, lzma modules if available, to decompress the archive.

To count the number of regular files in a tar archive:

#!/usr/bin/env python
import sys
import tarfile

with tarfile.open(sys.argv[1]) as archive:
    count = sum(1 for member in archive if member.isreg())
print(count)

It may support gzip, bz2 and lzma compression depending on version of Python.

You could find a 3rd-party module that would provide a similar functionality for 7z archives.

To get the number of files in an archive using 7z utility:

import os
import subprocess

def count_files_7z(archive):
    s = subprocess.check_output(["7z", "l", archive], env=dict(os.environ, LC_ALL="C"))
    return int(re.search(br'(\d+)\s+files,\s+\d+\s+folders$', s).group(1))

Here's version that may use less memory if there are many files in the archive:

import os
import re
from subprocess import Popen, PIPE, CalledProcessError

def count_files_7z(archive):
    command = ["7z", "l", archive]
    p = Popen(command, stdout=PIPE, bufsize=1, env=dict(os.environ, LC_ALL="C"))
    with p.stdout:
        for line in p.stdout:
            if line.startswith(b'Error:'): # found error
                error = line + b"".join(p.stdout)
                raise CalledProcessError(p.wait(), command, error)
    returncode = p.wait()
    assert returncode == 0
    return int(re.search(br'(\d+)\s+files,\s+\d+\s+folders', line).group(1))

Example:

import sys

try:
    print(count_files_7z(sys.argv[1]))
except CalledProcessError as e:
    getattr(sys.stderr, 'buffer', sys.stderr).write(e.output)
    sys.exit(e.returncode)

To count the number of lines in the output of a generic subprocess:

from functools import partial
from subprocess import Popen, PIPE, CalledProcessError

p = Popen(command, stdout=PIPE, bufsize=-1)
with p.stdout:
    read_chunk = partial(p.stdout.read, 1 << 15)
    count = sum(chunk.count(b'\n') for chunk in iter(read_chunk, b''))
if p.wait() != 0:
    raise CalledProcessError(p.returncode, command)
print(count)

It supports unlimited output.

Could you explain why buffsize=-1 (as opposed to buffsize=1 in your previous answer: stackoverflow.com/a/30984882/281545)

bufsize=-1 means use the default I/O buffer size instead of bufsize=0 (unbuffered) on Python 2. It is a performance boost on Python 2. It is default on the recent Python 3 versions. You might get a short read (lose data) if on some earlier Python 3 versions where bufsize is not changed to bufsize=-1.

This answer reads in chunks and therefore the stream is fully buffered for efficiency. The solution you've linked is line-oriented. bufsize=1 means "line buffered". There is minimal difference from bufsize=-1 otherwise.

and also what the read_chunk = partial(p.stdout.read, 1 << 15) buys us ?

It is equivalent to read_chunk = lambda: p.stdout.read(1<<15) but provides more introspection in general. It is used to implement wc -l in Python efficiently.

181

answered Oct 17 '22 15:10

jfs

Since I already have 7z.exe bundled with the app and I surely want to avoid a third party lib, while I do need to parse rar and 7z archives I think I will go with:

regErrMatch = re.compile(u'Error:', re.U).match # needs more testing
r"""7z list command output is of the form:
   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2015-06-29 21:14:04 ....A       <size>               <filename>
where ....A is the attribute value for normal files, ....D for directories
"""
reFileMatch = re.compile(ur'(\d|:|-|\s)*\.\.\.\.A', re.U).match

def countFilesInArchive(srcArch, listFilePath=None):
    """Count all regular files in srcArch (or only the subset in
    listFilePath)."""
    # https://stackoverflow.com/q/31124670/281545
    command = ur'"%s" l -scsUTF-8 -sccUTF-8 "%s"' % ('compiled/7z.exe', srcArch)
    if listFilePath: command += u' @"%s"' % listFilePath
    proc = Popen(command, stdout=PIPE, startupinfo=startupinfo, bufsize=-1)
    length, errorLine = 0, []
    with proc.stdout as out:
        for line in iter(out.readline, b''):
            line = unicode(line, 'utf8')
            if errorLine or regErrMatch(line):
                errorLine.append(line)
            elif reFileMatch(line):
                length += 1
    returncode = proc.wait()
    if returncode or errorLine: raise StateError(u'%s: Listing failed\n' + 
        srcArch + u'7z.exe return value: ' + str(returncode) +
        u'\n' + u'\n'.join([x.strip() for x in errorLine if x.strip()]))
    return length

Error checking as in Python Popen - wait vs communicate vs CalledProcessError by @JFSebastien

My final(ish) based on accepted answer - unicode may not be needed, kept it for now as I use it everywhere. Also kept regex (which I may expand, I have seen things like re.compile(u'^(Error:.+|.+ Data Error?|Sub items Errors:.+)',re.U). Will have to look into check_output and CalledProcessError.

def countFilesInArchive(srcArch, listFilePath=None):
    """Count all regular files in srcArch (or only the subset in
    listFilePath)."""
    command = [exe7z, u'l', u'-scsUTF-8', u'-sccUTF-8', srcArch]
    if listFilePath: command += [u'@%s' % listFilePath]
    proc = Popen(command, stdout=PIPE, stdin=PIPE, # stdin needed if listFilePath
                 startupinfo=startupinfo, bufsize=1)
    errorLine = line = u''
    with proc.stdout as out:
        for line in iter(out.readline, b''): # consider io.TextIOWrapper
            line = unicode(line, 'utf8')
            if regErrMatch(line):
                errorLine = line + u''.join(out)
                break
    returncode = proc.wait()
    msg = u'%s: Listing failed\n' % srcArch.s
    if returncode or errorLine:
        msg += u'7z.exe return value: ' + str(returncode) + u'\n' + errorLine
    elif not line: # should not happen
        msg += u'Empty output'
    else: msg = u''
    if msg: raise StateError(msg) # consider using CalledProcessError
    # number of files is reported in the last line - example:
    #                                3534900       325332  75 files, 29 folders
    return int(re.search(ur'(\d+)\s+files,\s+\d+\s+folders', line).group(1))

Will edit this with my findings.

answered Oct 17 '22 16:10

Mr_and_Mrs_D

Related questions
                            
                                Slow Julia Startup Time
                            
                                Matplotlib.Pyplot does not show output; No Error
                            
                                Add a folder to the Python library path, once for all (Windows)
                            
                                Resampling trade data into OHLCV with pandas
                            
                                NLTK collocations for specific words
                            
                                Django's CachedStaticFilesStorage not hashing file urls
                            
                                Using python mock to count number of method calls
                            
                                Load Python 2 .npy file in Python 3
                            
                                Starting the ipython notebook
                            
                                "The owner of this website has banned your access based on your browser's signature" ... on a url request in a python program
                            
                                How to extract schema for avro file in python
                            
                                Counting relationships in SQLAlchemy
                            
                                How to Find Documents That are in the same Cluster with KMeans
                            
                                name 'get_config' is not defined
                            
                                how to close pandas dataframe plot
                            
                                Pylint warning: Possible unbalanced tuple unpacking with sequence
                            
                                How do chained comparisons in Python actually work?
                            
                                Why use re.match(), when re.search() can do the same thing?
                            
                                Get row numbers of rows matching a condition in numpy
                            
                                Python win32gui SetAsForegroundWindow function not working properly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to programmatically count the number of files in an archive using python

Tags:

python

subprocess

python-2.7

popen

7zip

Mr_and_Mrs_D

People also ask

2 Answers

jfs

Mr_and_Mrs_D

Recent Activity

Donate For Us