I'm on Python 2.7.1 and I'm trying to identify all text files that don't contain some text string.
The program seemed to be working at first but whenever I add the text string to a file, it keeps coming up as if it doesn't contain it (false positive). When I check the contents of the text file, the string is clearly present.
The code I tried to write is
def scanFiles2(rdir,sstring,extens,start = '',cSens = False):
fList = []
for fol,fols,fils in os.walk(rdir):
fList.extend([os.path.join(rdir,fol,fil) for fil in fils if fil.endswith(extens) and fil.startswith(start)])
if fList:
for fil in fList:
rFil = open(fil)
for line in rFil:
if not cSens:
line,sstring = line.lower(), sstring.lower()
if sstring in line:
fList.remove(fil)
break
rFil.close()
if fList:
plur = 'files do' if len(fList) > 1 else 'file does'
print '\nThe following %d %s not contain "%s":\n'%(len(fList),plur,sstring)
for fil in fList:
print fil
else:
print 'No files were found that don\'t contain %(sstring)s.'%locals()
scanFiles2(rdir = r'C:\temp',sstring = '!!syn',extens = '.html', start = '#', cSens = False)
I guess there's a flaw in the code but I really don't see it.
UPDATE
The code still comes up with many false positives: files that do contain the search string but are identified as not containing it.
Could text encoding be an issue here? I prefixed the search string with U
to account for Unicode encoding but it didn't make any difference.
Does Python in some way cache file contents? I don't think so but that could somewhat account for files to still pop up after having been corrected.
Could some kind of malware cause symptoms like these? Seems highly unlikely to me but I'm kinda desperate to get this fixed.
You want to use the "-L" option of grep : -L, --files-without-match Only the names of files not containing selected lines are written to standard output. Path- names are listed once per file searched.
You need to use the grep command. The grep command or egrep command searches the given input FILEs for lines containing a match or a text string.
You can use grep tool to search recursively the current folder, like: grep -r "class foo" . Alternatively, use ripgrep .
Grep Command. grep is a built-in Linux command that prints lines that match a given pattern. It returns all the lines of a file that contain a certain string by default, and the command is also case-sensitive.
Modifying element while iterating the list cause unexpected results:
For example:
>>> lst = [1,2,4,6,3,8,0,5]
>>> for n in lst:
... if n % 2 == 0:
... lst.remove(n)
...
>>> lst
[1, 4, 3, 0, 5]
Workaround iterate over copy
>>> lst = [1,2,4,6,3,8,0,5]
>>> for n in lst[:]:
... if n % 2 == 0:
... lst.remove(n)
...
>>> lst
[1, 3, 5]
Alternatively, you can append valid file path, instead of removing from the whole file list.
Modified version (appending file that does not contian sstring
instead of removing):
def scanFiles2(rdir, sstring, extens, start='', cSens=False):
if not cSens:
# This only need to called once.
sstring = sstring.lower()
fList = []
for fol, fols, fils in os.walk(rdir):
for fil in fils:
if not (fil.startswith(start) and fil.endswith(extens)):
continue
fil = os.path.join(fol, fil)
with open(fil) as rFil:
for line in rFil:
if not cSens:
line = line.lower()
if sstring in line:
break
else:
fList.append(fil)
...
list.remove
takes O(n) time, while list.append
takes O(1). See Time Complexity.with
statement if possible.Falsetru already showed you why you should not remove lines from a list while looping over it; list iterators do not and cannot update their counter when a list is shortened, so if item 3 was processed but you removed that item, the next iteration item 4 was previously located at index 5.
List comprehension version using fnmatch.filter()
and any()
and a filter lambda
for case insensitive matching:
import fnmatch
def scanFiles2(rdir, sstring, extens, start='', cSens=False):
lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s
ffilter = '{}*{}'.format(start, extens)
return [os.path.join(r, fname)
for r, _, f in os.walk(rdir)
for fname in fnmatch.filter(f, ffilter)
if not any(lfilter(l) for l in open(os.path.join(root, fname)))]
but perhaps you'd be better off sticking to a more readable loop:
def scanFiles2(rdir, sstring, extens, start='', cSens=False):
lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s
ffilter = '{}*{}'.format(start, extens)
result = []
for root, _, files in os.walk(rdir):
for fname in fnmatch.filter(files, ffilter):
fname = os.path.join(r, fname)
with open(fname) as infh:
if not any(lfilter(l) for l in infh):
result.append(fname)
return result
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With