Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine number of files on a drive with Python?

I have been trying to figure out how to retrieve (quickly) the number of files on a given HFS+ drive with python.

I have been playing with os.statvfs and such, but can't quite get anything (that seems helpful to me).

Any ideas?

Edit: Let me be a bit more specific. =]

I am writing a timemachine-like wrapper around rsync for various reasons, and would like a very fast estimate (does not have to be perfect) of the number of files on the drive rsync is going to scan. This way I can watch the progress from rsync (if you call it like rsync -ax --progress, or with the -P option) as it builds its initial file list, and report a percentage and/or ETA back to the user.

This is completely separate from the actual backup, which is no problem tracking progress. But with the drives I am working on with several million files, it means the user is watching a counter of the number of files go up with no upper bound for a few minutes.

I have tried playing with os.statvfs with exactly the method described in one of the answers so far, but the results do not make sense to me.

>>> import os
>>> os.statvfs('/').f_files - os.statvfs('/').f_ffree
64171205L

The more portable way gives me around 1.1 million on this machine, which is the same as every other indicator I have seen on this machine, including rsync running its preparations:

>>> sum(len(filenames) for path, dirnames, filenames in os.walk("/"))
1084224

Note that the first method is instantaneous, while the second one made me come back 15 minutes later to update because it took just that long to run.

Does anyone know of a similar way to get this number, or what is wrong with how I am treating/interpreting the os.statvfs numbers?

like image 949
Mike Boers Avatar asked Feb 22 '09 03:02

Mike Boers


People also ask

How do I get a count of files?

To determine how many files there are in the current directory, put in ls -1 | wc -l. This uses wc to do a count of the number of lines (-l) in the output of ls -1.


2 Answers

The right answer for your purpose is to live without a progress bar once, store the number rsync came up with and assume you have the same number of files as last time for each successive backup.

I didn't believe it, but this seems to work on Linux:

os.statvfs('/').f_files - os.statvfs('/').f_ffree

This computes the total number of file blocks minus the free file blocks. It seems to show results for the whole filesystem even if you point it at another directory. os.statvfs is implemented on Unix only.

OK, I admit, I didn't actually let the 'slow, correct' way finish before marveling at the fast method. Just a few drawbacks: I suspect .f_files would also count directories, and the result is probably totally wrong. It might work to count the files the slow way, once, and adjust the result from the 'fast' way?

The portable way:

import os
files = sum(len(filenames) for path, dirnames, filenames in os.walk("/"))

os.walk returns a 3-tuple (dirpath, dirnames, filenames) for each directory in the filesystem starting at the given path. This will probably take a long time for "/", but you knew that already.

The easy way:

Let's face it, nobody knows or cares how many files they really have, it's a humdrum and nugatory statistic. You can add this cool 'number of files' feature to your program with this code:

import random
num_files = random.randint(69000, 4000000)

Let us know if any of these methods works for you.

See also How do I prevent Python's os.walk from walking across mount points?

like image 110
joeforker Avatar answered Oct 19 '22 08:10

joeforker


You could use a number from a previous rsync run. It is quick, portable, and for 10**6 files and any reasonable backup strategy it will give you 1% or better precision.

like image 20
jfs Avatar answered Oct 19 '22 10:10

jfs