Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all text files not containing some text string

I'm on Python 2.7.1 and I'm trying to identify all text files that don't contain some text string.

The program seemed to be working at first but whenever I add the text string to a file, it keeps coming up as if it doesn't contain it (false positive). When I check the contents of the text file, the string is clearly present.

The code I tried to write is

def scanFiles2(rdir,sstring,extens,start = '',cSens = False): 
    fList = []
    for fol,fols,fils in os.walk(rdir): 
        fList.extend([os.path.join(rdir,fol,fil) for fil in fils if fil.endswith(extens) and fil.startswith(start)]) 
    if fList: 
        for fil in fList: 
            rFil = open(fil) 
            for line in rFil: 
                if not cSens: 
                    line,sstring = line.lower(), sstring.lower() 
                if sstring in line:
                    fList.remove(fil) 
                    break
            rFil.close() 
    if fList:
        plur = 'files do' if len(fList) > 1 else 'file does'
        print '\nThe following %d %s not contain "%s":\n'%(len(fList),plur,sstring) 
        for fil in fList: 
            print fil 
    else: 
        print 'No files were found that don\'t contain %(sstring)s.'%locals() 
scanFiles2(rdir = r'C:\temp',sstring = '!!syn',extens = '.html', start = '#', cSens = False) 

I guess there's a flaw in the code but I really don't see it.

UPDATE

The code still comes up with many false positives: files that do contain the search string but are identified as not containing it.

Could text encoding be an issue here? I prefixed the search string with U to account for Unicode encoding but it didn't make any difference.

Does Python in some way cache file contents? I don't think so but that could somewhat account for files to still pop up after having been corrected.

Could some kind of malware cause symptoms like these? Seems highly unlikely to me but I'm kinda desperate to get this fixed.

like image 917
RubenGeert Avatar asked Dec 13 '13 06:12

RubenGeert


People also ask

How do you find files that do not contain a string?

You want to use the "-L" option of grep : -L, --files-without-match Only the names of files not containing selected lines are written to standard output. Path- names are listed once per file searched.

How do I search all files for a specific string?

You need to use the grep command. The grep command or egrep command searches the given input FILEs for lines containing a match or a text string.

How do I search for a string in all text in a directory?

You can use grep tool to search recursively the current folder, like: grep -r "class foo" . Alternatively, use ripgrep .

How do I search for files containing specific text in Linux?

Grep Command. grep is a built-in Linux command that prints lines that match a given pattern. It returns all the lines of a file that contain a certain string by default, and the command is also case-sensitive.


2 Answers

Modifying element while iterating the list cause unexpected results:

For example:

>>> lst = [1,2,4,6,3,8,0,5]
>>> for n in lst:
...     if n % 2 == 0:
...         lst.remove(n)
...
>>> lst
[1, 4, 3, 0, 5]

Workaround iterate over copy

>>> lst = [1,2,4,6,3,8,0,5]
>>> for n in lst[:]:
...     if n % 2 == 0:
...         lst.remove(n)
...
>>> lst
[1, 3, 5]

Alternatively, you can append valid file path, instead of removing from the whole file list.

Modified version (appending file that does not contian sstring instead of removing):

def scanFiles2(rdir, sstring, extens, start='', cSens=False): 
    if not cSens: 
        # This only need to called once.
        sstring = sstring.lower() 
    fList = []
    for fol, fols, fils in os.walk(rdir): 
        for fil in fils: 
            if not (fil.startswith(start) and fil.endswith(extens)):
                continue
            fil = os.path.join(fol, fil)
            with open(fil) as rFil:
                for line in rFil: 
                    if not cSens: 
                        line = line.lower()
                    if sstring in line:
                        break
                else:
                    fList.append(fil)
    ...
  • list.remove takes O(n) time, while list.append takes O(1). See Time Complexity.
  • Use with statement if possible.
like image 190
falsetru Avatar answered Oct 18 '22 15:10

falsetru


Falsetru already showed you why you should not remove lines from a list while looping over it; list iterators do not and cannot update their counter when a list is shortened, so if item 3 was processed but you removed that item, the next iteration item 4 was previously located at index 5.

List comprehension version using fnmatch.filter() and any() and a filter lambda for case insensitive matching:

import fnmatch

def scanFiles2(rdir, sstring, extens, start='', cSens=False):
    lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s
    ffilter = '{}*{}'.format(start, extens)
    return [os.path.join(r, fname)
            for r, _, f in os.walk(rdir)
            for fname in fnmatch.filter(f, ffilter)
            if not any(lfilter(l) for l in open(os.path.join(root, fname)))]

but perhaps you'd be better off sticking to a more readable loop:

def scanFiles2(rdir, sstring, extens, start='', cSens=False):
    lfilter = sstring.__eq__ if cSens else lambda l, s=sstring.lower(): l.lower() == s
    ffilter = '{}*{}'.format(start, extens)
    result = []
    for root, _, files in os.walk(rdir):
        for fname in fnmatch.filter(files, ffilter):
            fname = os.path.join(r, fname)
            with open(fname) as infh:
                if not any(lfilter(l) for l in infh):
                    result.append(fname)
    return result
like image 39
Martijn Pieters Avatar answered Oct 18 '22 15:10

Martijn Pieters